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Environmental  scanning  is  the  process  an  organization  uses  to  collect,  analyze  and 
use  information.  With  the  availability  of  vast  quantities  of  information  on  the  Internet,  an 
organization  has  a  great  need  for  an  automated  methodology  to  scan  and  use  this 
information.  Additionally,  the  information  available  via  the  Internet  is  mostly  text-based. 
Hence,  the  automated  scanning  methodology  developed  in  this  research  uses  the  well- 
founded  vector  space  model  (VSM)  to  represent  the  documents  available  via  the  Internet 
and  linear  discriminant  analysis  to  classify  the  documents.  Chapter  1  of  this  dissertation 
provides  an  introduction  to  the  environmental  scanning  problem  in  light  of  the  current 
problem  of  gathering  and  using  the  vast  quantity  of  information  available  via  the  Internet. 
Chapter  2  provides  a  review  of  the  literature  related  to  environmental  scanning,  the 
proposed  methodology  for  solving  the  problem  of  developing  an  automated  scanning 
process  and  the  environment  used  to  empirically  test  the  methodology  developed. 
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Chapter  3  describes  the  details  of  the  methodology  developed  in  this  dissertation 
and  the  application  environment  used  to  empirically  test  the  methodology.  The 
methodology  is  tested  by  collecting  news  documents  available  via  the  Internet  about 
publicly  traded  companies.  Chapter  4  has  additional  details  on  the  scanning  process  as 
well  as  a  description  of  the  experimental  design  used  to  empirically  test  the  scanning 
process.  The  experimental  design  involves  testing  both  a  trainiag  set  and  a  holdout 
sample  for  correct  classification  results.  Chapter  5  presents  the  results.  Finally,  the 
Chapter  6  provides  a  summary,  a  conclusion  and  directions  for  future  research. 
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CHAPTER  1 
INTRODUCTION 

;  Information  is  vital  to  an  organization's  survival.  The  amount  of  money  spent  on 

gathering  intelligence  by  organizations  is  in  the  range  of  $50,000  to  $1 .5  million 
(Badeian  1986).  Because  of  the  Internet,  the  availability  of  useful  information  for  a 
corporation  is  growing  at  a  fast  pace.  The  amount  of  information  stored  on  the  Internet 
has  been  estimated  to  double  or  more  than  double  every  18  months  (Yang  et  al.  2000). 
With  all  of  this  information  available,  companies  must  be  able  to  obtain  relevant 
documents  and  mine  them  for  useful  strategic  decision-making  information  to  gain 
competitive  advantage. 

Drucker  (1998)  claims  that  information  technology  (IT)  has  had  little  impact  on 
making  strategic  decisions  such  as  whether  or  not  to  enter  a  new  market  or  build  a  new 
office  building.  According  to  a  recent  article  by  Denton  (2001),  organizations  and 
individuals  within  the  organization  are  drowning  in  too  much  information.  However, 
Denton  (2001)  also  claims  that  the  Internet  has  the  capability  of  improving  knowledge 
management  and  should  be  used  to  simplify  decision  making  by  concentrating  on  critical 
information. 

Scanning  an  organization's  internal  and  external  environment  first  became 
popular  in  the  1960s  when  scanning  became  a  component  of  strategic  planning  (Russell 
and  Prince  1992).  Environmental  scanning  is  still  popular  today.  Ghoshal  and  Westney 
(1991)  reported,  based  on  a  1985  survey,  that  over  one-third  of  a  sample  of  Fortune  500 
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companies  were  spending  over  $1  million  a  year  on  competitive  analysis.  Additionally, 
Subramanian  et  al.  (1993)  reported  that  firms  with  environmental  scanning  systems  to 
monitor  the  external  environment  had  higher  growth  and  profitability  than  firms  that  did 
not.  General  Electric,  IBM,  J.  P.  Morgan,  Merrill  Lynch,  Motorola,  Schlumberger  and 
Xerox  already  use  the  Internet  for  information  gathering,  according  to  Pawar  and  Sharda 
(1997).  Their  article  claimed  that  environmental  scanning  activities  are  in  high  demand. 
Additionally,  they  said  that  the  availability  of  environmental  scannii^  or  business 
intelligence  systems  that  meet  company  needs  is  inadequate. 

Presently,  one  type  of  system  used  for  scanning  the  environment,  as  well  as  other 
fimctions,  is  an  executive  information  system  (EIS).  According  to  Moad  (1988),  many 
small  firms  and  70%  of  large  firms  currently  use  or  are  considering  EIS.  According  to 
Bajwa  et  al.  (1998),  EIS  is  growing  rapidly.  Several  EIS  software  packages  are  available 
including  ACE  Reports,  COR  Technology,  e.Reporting  Suite  and  eMis  Executive 
Information  Systems.  However,  past  literature  has  reported  feilure  rates  between  40  and 
70%  for  EIS  (Raths  1989,  Watson  et  al.  1991). 

As  the  previous  discussion  suggests,  there  are  several  key  reasons  why  an 
environmental  scanning  process  for  collecting,  analyzing  and  interpreting  information 
available  on  the  Internet  is  desirable,  including  the  importance  of  information  to  a 
corporation,  the  vast  amount  of  information  available  via  the  Internet,  the  inability  of 
many  systems  to  process  information  for  decision-making  purposes,  the  link  between 
profitability  and  external  scanning  systems  and  the  lack  of  availability  of  scanning 
systems  that  meet  companies'  needs.  Developing  and  testing  a  web-based  scanning 
process  is  the  essence  of  the  research  in  this  dissertation. 


The  process  developed  involves  four  main  steps:  (1)  collecting  web  documents, 
(2)  representing  the  documents  collected  via  the  popular  vector  space  model  (VSM) 
representation  (Salton  1968),  (3)  separating  a  training  set  of  documents  using  linear 
discriminant  analysis  (LDA)  and  (4)  using  the  linear  discriminant  function  determined  by 
LDA  to  classify  new  documents  according  to  their  signal.  Our  process  combines  two 
well-established  areas  of  research:  the  vector  space  model  (VSM)  for  information 
retrieval  developed  by  Salton  (1968)  and  linear  discriminant  analysis  &st  introduced  by 
Fisher  (1936)  to  scan  web  documents  for  signals  that  can  aid  in  an  organization's  ability 
to  make  decisions. 

1.1  Research  Problem 

As  linear  discriminant  analysis  is  the  technique  used  to  classify  documents 
according  to  their  signals,  one  inqwrtant  problem  is  to  determine  how  well  LDA 
classifies  the  documents.  Another  problem  is  the  degree  to  which  new  documents  can  be 
used  for  future  classification.  Both  of  these  problems  are  investigated  in  this  dissertation. 

1.2  Purpose 

The  need  for  better  tools  for  environmental  scanning  coupled  with  the  availability 
of  easily  accessible  information  on  the  Internet  presents  a  real  need  for  an  automated 
scanning  tool  that  detects  signals  based  on  information  gathered  fi'om  the  web. 
Additionally,  the  ability  to  use  this  tool  to  quantify  and  classify  information  for  use  by  a 
business  is  important.  Therefore,  the  purpose  of  this  dissertation  is  to  provide  a 
framework  for  how  a  company  can  use  the  Internet  to  obtain  external  information  about 
signals  from  the  environment  to  help  make  better  decisions.  A  process  for  automated 
web-based  environmental  scanning  was  developed  and  empirically  tested  in  this  study. 
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1.3  Organization 

Chapter  2  provides  a  literature  review  of  several  areas  related  to  the  process 
developed  in  this  dissertation.  These  areas  are  environmental  scanning,  the  VSM  for  IR, 
linear  discriminant  analysis,  equity  investments  in  finance  and  text  classification 
problems  in  computer  science.  Chapter  3  provides  a  detailed  discussion  of  the  problem 
setting,  the  application  environment  and  research  questions  for  this  dissertation. 
Additionally,  Chapter  3  provides  the  details  of  the  VSM  representation  and  the  linear 
discriminant  analysis  problem.  Chapter  4  has  additional  details  on  the  environmental 
scanning  process  and  discusses  the  experimental  design.  Chapter  5  provides  the  results 
of  our  empirical  analysis  of  the  process.  Chapter  6  provides  a  summary  of  our  findings, 
conclusions  and  directions  for  fixture  research. 


CHAPTER  2 
LITERATURE  REVIEW 

In  this  chapter  we  review  the  literature  related  to  this  study  of  the  automation  of 
environmental  scanning  of  the  World  Wide  Web  (WWW).  Environmental  scanning  is 
the  process  of  obtaining  and  using  information  from  an  organization's  external 
environment  to  assist  in  decision-making  (Aguilar  1967,  Choo  and  Auster  1993).  Lester 
and  Waters  (1989)  said  environmental  scanning  is  a  process  that  uses  information  from 
the  environment  to  aid  management  in  decision-making.  There  are  three  key  components 
to  this  process.  The  &st  is  obtaining  the  information,  the  second  analyzing  the 
information  and  the  third  using  the  information  for  making  decisions  in  an  organization 
(Lester  and  Waters  1989).  According  to  Daft  and  Weick  (1984),  the  way  an  organization 
deciphers  its  environment  in  order  to  learn  from  it  may  be  divided  into  three  phases: 
scanning,  interpretation,  and  learning.  Scanning  involves  information  gathering. 
Interpretation  involves  giving  meaning  to  the  data.  Learning  involves  taking  action  based 
on  the  data.  All  of  these  definitions  have  the  common  components  of  information 
gathering,  analysis  and  use. 

Scanning  is  a  form  of  "organizational  browsing."  The  benefits  of  scanning 
include  finding  the  information  desired  by  the  organization,  better  decision  making  due  to 
the  use  of  information  found  during  scanning,  updating  initial  information  requirements 
due  to  information  discoveries,  and  the  discovery  of  usefiil  information  found  quite  by 
accident.  Scanning  that  is  managed  poorly  can  result  in  too  much  information  that  is 
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confusing.  This  can  lead  to  wasting  employees'  time  and  energy,  leading  to  high  costs 
and  wasted  money  with  little  or  no  action  taken  as  a  result  of  scanning  (Choo  1995). 

This  study  proposes  methodology  to  automate  the  environmental  scanning 
process  through  the  use  of  web-based  methods.  The  automated  process  combines  two 
techniques:  the  vector  space  model  for  information  retrieval  introduced  by  Salton  (1968) 
and  discriminant  analysis  techniques  originally  introduced  by  Fisher  (1936).  The  two 
techniques  combined  are  used  to  solve  a  text  classification  problem.  Additionally,  the 
information  being  analyzed  is  news  articles  about  stocks  in  the  stock  market.  The  chapter 
is  organized  as  follows.  Section  1  provides  a  literature  review  of  environmental  scanning 
and  executive  information  systems  (EIS).  Section  2  is  a  review  of  the  vector  space  model 
literature.  Discriminant  analysis  is  discussed  in  Section  3.  Related  work  in  finance  is 
presented  in  Section  4.  Finally,  a  brief  overview  of  literature  on  the  text  classification 
problem  is  provided  in  Section  5. 

2.1  Environmental  Scanning 

Environmental  forces  that  affect  an  organization  can  be  classified  into  six 
categories:  demographics,  economics,  social  and  cultural,  political  and  legal, 
technological,  and  competitive.  Demographic  forces  involve  the  characteristics  of  the 
population  being  studied  such  as  birthrate,  &mily  structure,  divorce  rate  and  education. 
Economic  forces  include  inflation  rate,  spending  patterns,  income  levels  and  purchasing 
power.  Social  and  cultural  forces  that  affect  an  organization  involve  the  set  of  values, 
ideas  and  attitudes  held  by  various  cultures  or  social  groups  within  the  targeted  consumer 
population.  These  attitudes  can  affect  buying  patterns  of  the  population  of  consumers. 
Political  and  legal  forces  define  the  limits  within  which  an  organization  can  operate.  The 


federal  government  imposes  regulations  and  policies  in  areas  such  as  airline  overbooking, 
food  labels,  prescription  drugs,  nursing  homes  and  import  tarifis  (Michman  1983). 
Technological  forces  can  affect  an  organization.  Changing  technology  can  dramatically 
change  consumer  markets  (Michman  1983).  Finally,  competitive  forces  continually 
challenge  organizations  (Michman  1983).  These  forces  are  interrelated,  yet  beyond  an 
organization's  control. 

Aguilar  (1967)  identifies  four  methods  of  environmental  scanning:  undirected 
viewing,  conditioned  viewing,  informal  search  and  formal  search.  In  undirected  viewing, 
scanning  is  done  incidentally  and  without  purpose.  The  manager  is  not  prepared  to 
analyze  the  information.  Conditioned  viewing  involves  exposure  to  selected  information 
without  searching  for  it.  The  manager  is  prepared  to  analyze  or  interpret  the  information. 
In  informal  search,  information  is  actively  sought  without  a  structured  search  strategy. 
Formal  search  involves  a  manager  actively  seeking  information  using  a  formal  search 
method. 

Choo  (1995)  offered  a  different  perspective  on  the  types  of  scanning.  Choo,  like 
Aguilar,  recognized  four  scanning  methods.  The  first  two,  undirected  and  conditioned 
viewing,  were  first  proposed  by  Aguilar  (1967).  The  other  two  methods  are  enacting  and 
discovery.  Daft  and  Weick  (1984)  proposed  that  the  type  of  scanning  depends  on  how 
management  perceives  the  analyzability  of  the  external  environment  and  depends  on  the 
intrusiveness  of  the  organizatiorL  If  management  perceives  the  environment  as  highly 
analyzable,  they  will  participate  in  more  structured  environmental  scanning.  Intrusiveness 
into  the  environment  is  determined  by  how  willing  an  organization  is  to  intrude  or 
interfere  with  the  external  environment  in  order  to  imderstand  it.  An  organization  is 
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considered  active  or  passive  in  terms  of  their  organizational  intrusiveness.  An  actively 
intrusive  organization  would  seek  out  information  while  a  passively  intrusive 
organization  would  attempt  to  understand  the  external  envirorunent  based  on  any 
information  that  happened  to  be  provided  without  actively  seeking  that  information  out. 

Based  on  the  two  dimensions  described,  organizational  intrusiveness  and 
analyzability  of  the  environment,  Choo  (1995)  proposed  the  scanning-interpretation 
model.  According  to  the  model,  a  passively  intrusive  organization  that  perceives  the 
environment  as  unanalyzable  will  participate  in  undirected  viewing.  A  passively 
intrusive  organization  that  perceives  the  environment  as  analyzable  will  participate  in 
conditioned  viewing.  An  actively  intrusive  organization  that  perceives  the  environment 
as  unanalyzable  will  participate  in  enacting.  Organizations  participating  in  enacting  have 
a  need  to  learn  by  doing,  to  do  experimentation,  often  on  the  environment,  and  to  seek 
information  in  the  form  of  feedback  from  sources  about  the  organization's  activities.  An 
actively  intrusive  organization  that  perceives  the  environment  to  be  analyzable  will 
participate  in  discovery.  Discovery  involves  formal  information  needs  that  are  satisfied 
through  many  sources  through  surveys  and  market  research.  "In  summary,  the 
scanning-interpretation  model  appears  to  be  a  viable  framework  for  analyzing  the 
primary  environmental  and  organizational  contingencies  that  influence  environmental 
scanning  as  cycles  of  information  seeking  and  information  use  activities."  (Choo  1995, 
p.85) 

Michman  (1983)  focuses  on  monitoring  the  changes  that  take  place  in  the 
environment  surrounding  an  organization  and  planning  to  make  adjustments  in  the 
marketing  strategy  of  an  organization  based  on  information  obtained  from  monitoring. 


Seven  techniques  are  mentioned  that  were  adapted  to  scanning  the  environment:  trend 
extrapolation,  Delphi  technique,  cross-impact  analysis,  simulation  models,  barometric 
forecasts,  trend  impact  analysis,  and  multiple  scenarios.  Michman  (1983)  identified 
conditions,  current  in  1983,  of  demographic  trends  and  their  implications,  the  economic 
dimension  of  the  environment,  the  social  dimension,  the  political/legal  dimension,  the 
technological  dimension,  and  the  competitive  dimension.  A  discussion  of  monitoring 
each  of  these  dimensions  and  monitoring  ecological  changes  is  provided. 

Jonsen  (1986)  considered  environmental  scanning  by  Universities.  According  to 
Jonsen  (1986),  environmental  scanning  is  extremely  important  to  the  success  of 
institutions  of  higher  learning,  and  that  importance  is  not  yet  realized.  Although  many 
university  systems  have  environmental  scanning  structures  already  in  place,  these 
structures  are  not  using  all  information  available  to  the  university.  University  systems 
typically  scan  for  information  about  demographic  and  economic  environments,  ignoring 
the  political,  organizational  (or  competitive),  technological  and  social  aspects  of  the 
environment.  Additionally,  Jonsen  (1 986)  proposes  that  even  if  universities  have  all  sbc 
fecets  of  information  available,  they  do  not  have  a  system  in  place  to  understand  the 
environment.  They  still  need  a  way  to  imderstand  and  integrate  the  information. 
Universities  need  to  be  able  to  make  decisions  and  plan  based  on  the  understanding  and 
integration  of  the  information.  Also,  universities  need  to  make  scanning  of  all  facets  of 
the  environment,  and  using  that  information,  a  high  priority. 

Pawar  and  Sharda  (1997)  suggested  using  the  Internet  and  its  various  utilities  to 
scan  the  environment  for  external  information.  Their  article  classified  the  ability  of  the 
Internet  to  provide  certain  classes  of  information  content.  Additionally,  the  article 
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provided  insight  into  the  various  types  of  search  done  over  the  Internet  along  with  the 
Internet  utilities  best  suited  for  those  types  of  search.  The  types  of  search,  originally 
discussed  in  Aguilar  (1967),  are  undirected  viewing,  conditioned  viewing,  informal 
search  and  formal  search.  In  the  paper  by  Pawar  and  Sharda  (1997)  the  various  Internet 
utilities  are  classified  as  low,  medium  or  high  in  terms  of  their  ability  to  do  the  four 
different  modes  of  search. 

Camillus  and  Datta  (1991)  suggested  that  there  are  different  patterns  of  acquiring 
information  depending  on  the  system  of  strategic  planning  used:  strategic  planning 
systems  (SPS)  and  strategic  issues  management  systems  (SIMS).  SPS  focuses  on 
information  that  is  directly  related  to  the  organization  and  uses  directed  environmental 
scanning.  SIMS  monitors  the  environment  continuously,  looking  for  all  signals,  even 
signals  that  might  be  weak.  Therefore,  SIMS  is  more  likely  to  use  continuous  undirected 
or  semi-directed  environmental  scanning  processes.  Additionally,  their  article  (Camillus 
and  Datta  1991)  suggested  combining  the  systems  for  a  semi-directed,  continuous 
approach  to  environmental  scanning. 

Subramanian  et  al.  (1993)  conducted  a  survey  to  assess  environmental  scanning 
activities  in  Fortune  500  corporations.  They  used  a  method  of  classification  fi-om  Jain 
(1984).  Organizations  can  be  classified  over  time  as  primitive,  ad  hoc,  reactive  or 
proactive  in  their  scanning  activities.  The  results  based  on  the  101  firms  that  responded 
to  the  survey  were  10%  of  firms  operated  in  the  primitive  scanning  mode,  30%  in  the  ad 
hoc  mode;  35%  had  reactive  systems  and  25%  had  proactive  systems.  Reactive  and 
proactive  scanning  systems  are  considered  advanced  scanning  activities. 
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Elofson  et  al.  (1997)  proposed  a  system  architecture  for  sharing  knowledge.  The 
architecture  consists  of  several  nodes  connected  by  a  network.  At  each  node  there  is  a 
manager,  a  scheduler,  a  server,  a  planner,  knowledge  sources,  one  or  more  intelligent 
agents,  and  relevant  data  structures.  The  article  discusses  the  use  of  multiple  intelligent 
agents  used  to  make  decisions.  The  authors'  framework  provides  for  the  distribution  of 
knowledge  across  the  organization,  allowing  for  any  division  of  the  organization  to  gain 
access  to  the  knowledge  as  needed.  The  decision  support  system  architecture  described 
in  the  article  increases  the  ability  of  the  organization  to  maintain  a  memory  of,  and  to 
learn  from,  its  decisions. 

Walstrom  and  Wilson  (1997)  conducted  a  study  based  on  information  collected 
from  98  of  the  Corporate  1000  CEOs.  The  CEOs  were  asked  how  they  used  their  EIS. 
The  study  was  conducted  to  determine  the  types  of  EIS  users  and  to  classify  how  the 
different  types  used  their  EIS.  Based  on  their  study,  they  found  three  types  of  users  and 
three  underlying  dimensions  of  usage.  The  three  types  of  users  were  'converts,' 
'pacesetters'  and  'analyzers.'  'Converts'  use  the  EIS  to  increase  their  ability  to  access 
information.  'Pacesetters'  use  EIS  to  increase  their  ability  to  communicate  and  monitor 
performance.  'Analyzers'  use  EIS  to  solve  problems.  Walstrom  and  Wilson  (1997) 
performed  factor  analysis  to  determine  different  types  of  uses  of  the  EIS.  The  first,  called 
'organizational  monitoring,'  deals  with  using  the  EIS  to  monitor  email,  company  news 
and  organizational  data.  'Information  access'  is  the  second  dimension  of  use  discovered 
by  the  authors.  This  second  type  is  the  most  relevant  use  of  the  EIS  to  this  paper. 
'Information  access'  involves  the  use  of  the  system  to  gain  access  to  information  both 
internal  and  external  to  the  organization.  The  third  dimension  is  called  'organizational 
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understanding'  and  involves  using  the  system  to  help  users  understand  the  organization 
through  ". . .  unstructured  access  and  analysis  of  organizational  data"  (Walstrom  and 
Wilson  1997,  p.  81).  The  authors  then  classify  the  value  of  each  type  of  dimension  to 
each  identiJBed  type  of  user. 

Koh  and  Watson  (1998)  conducted  a  study  to  examine  data  management  issues  in 
EIS.  Based  on  their  research  a  set  of  seven  data  management  issues  were  identified.  The 
authors  then  chose  to  study  three  in  greater  detail  -  data  security,  data  ownership  and  data 
standards.  They  found  that  EIS  users  considered  data  standards  to  be  the  most  important 
issue  and  also  the  most  challenging.  This  is  attributed  to  the  fact  that  in  EIS  data  come 
fi-om  different  departments  and  levels  of  management.  They  tested  several  hypotheses,  of 
which  two  were  significant.  There  was  correlation  between  the  difficulty  of  data 
management  and  the  breadth  and  depth  of  the  data.  Correlation  also  existed  between  the 
level  of  support  fi"om  individuals  and  data  management  difficulty. 

Bajwa  et  al.  (1998)  conducted  a  study  to  identify  fectors  in  the  successful 
implementation  of  executive  information  systems.  The  study  examined  information 
system  support,  vendor/consultant  support,  and  management  support  to  determine  their 
influence  on  the  success  of  EIS.  Their  findings  were  based  on  data  collected  fi-om  sixty- 
nine  firms.  The  authors  foimd  that  IS  support  influences  EIS  success.  No  relation  was 
found  between  management  support  or  vendor/consultant  support  and  EIS  success.  This 
finding  is  contrary  to  the  literature.  Additionally  the  authors  found  that  management 
support  influences  IS  support  and  vendor/consultant  support. 

In  "Information  Management  for  the  Intelligent  Organization,"  Choo  (1995) 
conducted  a  review  of  pre- 1995  literature  on  environmental  scanning.  The  literature  is 
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reviewed  to  study  ". . .  the  effect  of  situational  dimensions,  organizational  strategies, 
information  needs,  and  personal  traits  on  scanning  behavior"  (Choo  1995,  p.  86).  Choo 
(1995)  grouped  information  needs  into  the  following  research  categories:  information 
needs  as  the  focus  of  environmental  scanning,  information-seeking  use  and  preferences, 
information  seeking  through  scanning  methods,  and  information  use.  The  literature  in 
each  category  was  reviewed. 

The  first  category  of  researcli,  information  needs  as  the  focus  of  environmental 
scanning,  focused  on  identifying  the  environmental  sectors  that  are  the  primary  focus  of 
environmental  scanning  activity.  Choo  (1995)  found  that  the  literature  (Aguilar  1967, 
Nishi  et  al.  1982,  Ghoshal  1988,  Johnson  and  Kuehn  1987,  Lester  and  Waters  1989, 
Auster  and  Choo  1993a,  Auster  and  Choo  1993b,  Choo  1993,  Olsen  et  al.  1994,  Jain 
1984)  suggests  that  organizations  consider  information  about  customers,  competitors  and 
suppliers  as  most  important. 

The  next  category,  information  seeking  use  and  preferences,  relates  to  research 
that  identifies  and  classifies  sources  of  information.  Information  sources  are  classified  as 
either  internal  or  external  to  the  organization  and  as  personal  or  impersonal  (Choo  1995). 
Personal  sources  convey  information  directly  to  the  manager.  Examples  of  impersonal 
sources  are  online  databases,  company  libraries  and  publications  (Choo  1995).  Choo 
(1995)  found  that  the  literature  (AguUar  1967,  Keegan  1967,  Keegan  1974,  O'Connell 
and  Zimmerman  1979,  Kobrin  et  al.  1980,  Smeftzer  et  al.  1988,  Lester  and  Waters  1989, 
Gates  1990,  Mayberry  1991,  Cuhian  1983,  Ghoshal  and  Kim  1986,  Auster  and  Choo 
1992,  Auster  and  Choo  1993a,  Auster  and  Choo  1993b)  indicates  that  managers  use  both 
personal  and  impersonal  sources  as  well  as  internal  and  external  sources.  Managers 
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prefer  personal  sources,  especially  when  gathering  information  about  customers, 
suppliers  or  competitors. 

The  third  category  of  research,  information-seeking  scanning  methods,  was 
directed  at  classifying  methods  of  environmental  scanning  and  identifying  the  fectors  that 
affect  the  choice  of  scanning  method  (Choo  1995).  Choo  (1995)  concluded  that,  based 
on  the  literature  (Aguilar  1967,  Keegan  1974,  Thomas  1980,  Klein  and  Linneman  1984, 
Preble  et  al.  1988,  Subramanian  et  al.  1993,  Wilson  and  Masser  1983,  Al-Hamad  1988, 
Mclntyre  1992,  Fahey  and  King  1977),  many  factors  affect  the  method  of  scanning  an 
organization  chooses  to  use  including  its  size,  industry  category,  environmental 
perception,  dependence  on  the  environment,  experience  in  scanning,  and  ejqperience  with 
strategic  planning. 

The  final  category  of  research,  information  use,  focuses  on  how  the  information 
obtained  fi-om  scanning  is  used  (Choo  1 995).  Many  studies  support  that  environmental 
scanning  improves  an  organization's  performance  (Miller  and  Friesen  1977,  Newgren  et 
al.  1984,  Dollinger  1984,  West  1988,  Daft  et  al.  1988,  Subramanian  et  al.  1993, 
Subramanian  et  aL  1994,  Murphy  1987,  Ptaszynski  1989).  Choo  (1995)  summarized  that 
". . .  environmental  scanning  is  increasingly  being  used  to  drive  the  strategic  planning 
processes  by  business  and  public  sector  organizations  in  most  developed  countries." 
(Choo  1995,  p.  101)  Additionally,  there  is  a  link  between  environmental  scanning  and 
how  an  organization  performs. 

2.2  Vector  Space  Model 

Retrieving  relevant  information  fi-om  a  set  of  documents  is  not  a  new  probleirL 
As  long  as  researchers  have  done  research,  techniques  for  finding  information  relevant  to 
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that  research  have  been  developed  and  fine-tuned.  Information  retrieval  is  one  of  the 
critical  areas  in  library  science.  Matching  documents  to  users'  queries  has  been  the  focus 
of  much  research.  The  problem  of  information  retrieval  still  exists  today,  wath  a  new 
twist.  Modem  information  retrieval  techniques  involve  searching  the  World  Wide  Web 
(WWW)  for  documents  relevant  to  a  user's  query.  The  vector  space  model,  one  solution 
•  to  retrieving  relevant  documents  based  on  a  user's  query  developed  many  years  ago,  is 
still  applicable  to  today's  information  retrieval  problem  on  the  WWW. 

In  information  retrieval,  there  is  a  set  of  documents  and  a  xiser  query.  Based  on 
the  query  issued,  information  retrieval  schemes  are  used  to  return  the  most  relevant 
documents  Irom  the  collection  to  the  user.  Since  relevancy  is  a  judgment  made  by  the 
user,  information  retrieval  methods  are  not  capable  of  producing  only  and  all  relevant 
documents.  Therefore  it  is  necessary  to  have  a  method  of  ranking  the  documents  in  terms 
of  their  similarity  to  the  user's  query.  The  vector  space  model  (VSM)  is  a  method  of 
information  retrieval  that  is  used  to  retrieve  and  rank  documents  (Salton  1968). 

The  VSM,  proposed  by  Salton  (1968)  involves  representing  a  document  by  a 
vector  of  terms  that  are  related  to  keywords  or  index  terms  in  the  document,  as  well  as 
weights  indicating  the  importance  of  these  terms  in  indicating  the  content  of  the 
document.  There  are  n  index  terms  in  a  collection  of  k  documents.  The  VSM  notation 
is  given  in  Table  1 .  Each  index  term  corresponds  to  a  vector  of  unit  length.  Let  t t  ^ 
be  the  vectors  corresponding  to  the  n  index  terms.  In  the  basic  model,  the  term  vectors 
are  assumed  to  be  uncorrelated.  This  simplifying  assumption  results  in  pairwise 
orthogonality  of  the  term  vectors  in  the  basic  VSM.  Document  r  is  represented  by  vector, 
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,  such  that  each  element  of  the  document  vector  dj^  represents  the  weight  of  term  i  in 
document  r. 

n 

i=l 

When  the  terms  are  orthogonal,  the  elements  of  d^  are  the  term  weights,  which  can  be 
confirmed  by  tjd^  =  H;  , .  For  a  collection  of  k  documents,  an  n  x  k  term  by  document 

matrix  D  can  be  constructed  with  the  docimient  vectors  where  each  column  of  the 
docmnent  matrix  corresponds  to  a  document  vector  d^  • 


Table  1.  Vector  space  model  (VSM)  notation 


Variable 

Meaning 

k 

Number  of  documents 

n 

Number  of  index  terms 

tj 

n  X 1  term  vector  representing  term  i 

T 

n  X  n  term  matrix  where  tj 's  are  the  columns 

n  X 1  document  vector  of  n  index  terms 

D 

n  X  k  matrix  where  d  's  are  the  columns 

represents  the  weight  of  term  i  in  document  r 

q 

n  X 1  query  vector  where  q-  represents  the  weight  of  term  i  in  query  q 

In  the  VSM  (Salton  1989),  a  vector  q' =  (q,,...,q„)  is  used  to  represent  a  user 


query.  The  vector  q  is  then  matched  against  the  docimient  vectors  to  determine  which 
documents  are  most  similar  to  the  query.  To  rank  the  documents  according  to  similarity, 
a  similarity  measure  is  used.  Several  similarity  measures  have  been  suggested,  including 
the  commonly  used  scalar  product  between  the  query  vector  q  and  document  vectors. 
The  scalar  product  between  two  vectors  x  and  y  is  defined  as 

x'y  =  |x||y|cos9 


17 

where  |x|  is  the  length  of  the  vector  x  and  9  is  the  angle  between  the  vectors  x  and  y . 
The  document-query  similarity  can  be  computed  by 

ij=l 

The  term-term  correlation,  tj  •  t j ,  is  not  usually  known  a  priori  and  in  the  basic  model,  the 

terms  are  assumed  to  be  uncorrelated  and  hence,  orthogonal.  Therefore,  in  the  basic 
model  the  term-term  correlation  is  given  by 

fl   ifi  =  j 
tj-t  =<^  • 
'     [0  otherwise 

This  results  in  the  orthogonality  of  the  term  vectors,  creating  a  linear  independent  set  of  n 
vectors.  The  term  vectors  t,,...,t„  are  therefore  a  basis  for  the  docimient  space.  When 

the  term  vectors  are  assumed  to  be  uncorrelated,  the  document-term  similarity  is  reduced 
to 

t  n 

An  example  of  the  basic  model  is  provided  in  Appendix  A. 

According  to  Sahon  (1989),  document-docimient  similarity  can  be  measured 
using  the  same  concept  as  document-query  similarity.  The  similarity  between  two 
documents  is  given  by 

d  d  =  y  a  a  1 1 

i,j=l 

which  reduces  to 

f  n 

d  d  =  >  a  a 

r  i,r  J,s 
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when  the  term  vectors  are  uncorrelated,  hence  orthogonal.  Computing  document- 
document  similarity  is  useful  for  determining  how  to  organize  the  documents  in  the 
document  space. 

Table  2.  Measures  of  query-document  similarity  


Inner  product 
Cosine 


Pseudo-cosine 


Dice 


Covariance 

Product-moment  correlation 


Jaccard  Coefficient 


Overlap 


u 

i=l 
n 


d.q  = 


I    u  u 

I  i=l  i=I 


Zai.q, 
d,q  =  ^=' 

n 


i=l  i=l 
n 


d,q  =  

n 

i=l  i=l 

drq  =  l](qi-q)(ai,r-d,) 

i=l 

i(q. -qX^u-dJ 


d,q  = 


Z(i.-5)Tk,-s,) 

i=l  i=l 
n 

Z^i-rqi 


d,q  =  — 

f  ^  n 


i=l 


d,q  = 


Zl.'+S^'.r-Z^rq, 

i=l  i=I  i=l 

Zmin(qiai,r) 


i=l 


mm 


u  u 

Zq.'E^i,r 


V  i=l  i=l 


As  previously  mentioned,  several  methods  of  measuring  the  similarity  between  a 
document  d^  and  a  query  have  been  suggested.  A  list  of  these  measures  is  provided  in 
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Table  2  under  the  assumption  that  the  term  vectors  are  orthogonal  where 
1  "         —     1  " 

q  =  —  Vq,  and  d,  =  —       r  •  The  cosine.  Dice,  Jaccard,  pseudo-cosine,  and  product- 

moment  correlation  measures  are  all  normalized  to  fell  within  the  range  [0,  1]  for 
nonnegative  vector  elements. 

An  interpretation  of  several  of  the  measures  is  given  in  Jones  and  Furnas  (1987). 
The  inner  product  measures  similarity  as  a  simple  weighted  simi.  When  the  term  weights 
in  both  the  document  and  query  vector  are  binary,  the  inner  product  measure  is  the 
cardinality  of  the  intersection  of  the  two  vectors  q  and  d, .  The  cosine  measure  is  the 
inner  product  measure  with  each  of  the  document  and  query  vectors  normalized  by  their 
Euclidean  or  t  ^  lengths.  The  normalization  provides  that  the  cosine  measure  ranges 
from  0  (no  matching  terms)  to  1  (perfect  match  between  the  vectors).  The  pseudo-cosine 
measure  is  the  inner  product  measure  with  the  document  and  query  vectors  normalized  by 
the  ^,  or  city-block  lengths.  The  Dice  measure  is  twice  the  inner  product  measure 
divided  by  the  sum  of  the  ^,  lengths  of  the  vectors.  The  covariance  measure  is  the  inner 

product  measure  of  two  new  vectors  q'  =  (q  -  q)  and  d,  =  (d^  -  d, )  where  q'  is  the 

average  term  weight  in  the  query  vector,  q ,  subtracted  from  q  and  d^  is  the  average 
term  weight  in  the  document  vector,  d, ,  subtracted  from  d^ .  The  product-moment 

correlation  measure  is  the  cosine  measure  of  the  two  new  vectors  q'  and  d^  .  The 
overlap  measure,  like  the  cosine  measure,  ranges  from  0  to  1 .  The  numerator  of  the 
overlap  measure  sums,  for  each  component  of  the  query  and  document  vectors,  the 
minimum  term  weight  values.  When  the  vectors  are  binary,  the  numerator  of  the  overlap 


20 

measure  and  the  inner  product  measure  are  the  same.  The  denominator  is  the  minimum 

of  the  f. ,  lengths  of  the  two  vectors.  When  the  term  weights  are  binary,  the  Jaccard 

measure  can  simply  be  thought  of  as  the  size  of  the  intersection  of  the  query  and 

document  vectors  divided  by  the  size  of  the  union  of  the  two  vectors. 

Wang  et  al.  (1992)  show  that  when  several  of  these  measures  are  normalized  they 

are  linear  by  the  general  definition  provided  in  their  paper.  The  definition  is  given  below. 

Let  m(q,d)  be  a  similarity  measure  on  9?"  x  9?" .  If  there  exists 
two  functions: 

N,:SR"^9?",q^q=:N,(q) 

and 

:9?"  ^9!",d^d  =  N<,(d) 

such  that 

m(q,d)=N,(q)N,(d)  =  q-d, 
then  we  say  m(q,d)  is  a  linear  measure.  The  function      can  be 
regarded  as  a  normalization  of  the  query  vector  and  the  function 
Nj  as  a  normalization  of  the  document  vectors.  (Wang  et  al 
1992,  pp.  154) 

Using  the  above  definition,  Wang  et  al.  (1992)  demonstrate  the  linearity  of  the 
inner-product,  cosine,  pseudo-cosine,  covariance  and  product-moment  correlation 
measures  are  linear.  The  Dice  measure  is  shown  to  be  linear  imder  the  special  case  that 

the  query  vector  q  =  (q,,...,qj  is  normalized  with  the  ^,  norm,  i.e.,  |q|  =  2^i  =1- 

i=l 

authors  foimd  the  necessary  and  suflScient  conditions  under  which  the  linear  measures  of 
similarity  between  documents  and  queries  would  produce  an  acceptable  ranking  of  the 
documents.  "The  important  point  is  that  this  result  establishes  the  basis  for  adopting 
these  similarity  measures  in  information  retrieval."  (Wang  et  al  1992,  pp.  159) 
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Indexing  algorithms  are  used  to  determine  the  set  of  index  terms  in  the  vector 
space  model.  Indexing  can  be  very  simple  or  quite  complex.  In  Salton  (1989)  several 
methodologies  for  indexing  are  provided.  The  indexing  methodologies  involve  several 
steps,  including  a  step  to  compute  the  weight  of  term,  t  ^ ,  for  document  dj .  The  term 

weighting  schemes  involve  computing  term  frequencies,  inverse  document  frequencies, 
term  discrimination  values  and  term-relevance  weights. 

The  term  frequency,  tfj  j ,  is  the  frequency  of  word  stem  t  j  in  document  dj .  The 

document  frequency,  df  ^ ,  is  the  number  of  documents  in  the  collection  that  contain  term 
t  j .  The  inverse  document  frequency  (idf )  is  given  by  log|^       j ,  where  k  is  the 

number  of  documents  in  the  collection. 

In  Salton  (1989),  the  term  discrimination  value,  dvj ,  for  term  tj  is  given  by  the 

difference  between  the  space  density  of  the  document  collection  without  the  term,  Q,  and 
the  space  density  with  the  term,  Qj . 

dv^=Q-Q. 

The  space  density  of  a  document  collection  is  ". . .  the  average  pairwise  similarity 
between  all  pairs  of  distinct  items."  (Salton  1989,  p.  282) 

The  higher  the  average  pairwise  similarity  between  the  distinct  documents  the  denser  the 
document  space.  The  addition  of  a  term  t  j  with  good  discriminatory  power  will  result  in 
a  document  space  that  is  less  dense.  Hence,  terms  that  are  good  discriminators  have 
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positive  term  discrimination  values.  Computing  the  space  density  of  the  document 
collection  with  (1)  is  quite  expensive  computationally.  A  more  efficient  computation 
involves  defining  a  document  centroid,  C  =  (c, ,  Cj , . . . ,  c„ ) ,  at  the  center  of  the  document 
space.  Element  j  of  the  centroid  is  defined  as  the  average  value  of  the  j*  terms  in  the 
document  collection,  d^ ,  and  is  given  by 

1 

c  =-yd.. 

The  space  density  is  the  average  similarity  between  each  document  in  the  collection  and 
the  centroid. 

Q  =  -^sim(c,di) 

This  measurement  involves  calculating  k  similarities  as  opposed  to  k(k  - 1)  similarities 
in(l). 

According  to  Salton  (1989),  the  retrieval  value  of  each  term  tj  in  a  document  is 
given  by 

t,=log^ 

and  is  called  the  term-relevance  weight.  The  term-relevance  weight  is  determined  by  the 
probability  of  occurrence  of  term  t-  in  ti,  or  Ttj.  Let  be  the  number  of  documents  in 
71,  that  contain  the  term  t j .  Assuming  that  there  are  k,  ( kj )  documents  in  n,  (    ),  then 
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where  pj  is  the  probability  of  term  t  j  occurring  in  71,  and  q  j  is  the  probability  of  term  t  j 
occurring  in  . 

A  single  term  indexing  algorithm  using  the  term  frequency  multiplied  by  the  inverse 
document  frequency,  the  term  discrimination  value  or  the  term-relevance  weight  follows 
as  elements  of  the  docimient  vector. 

1 .  After  cleaning  the  docimients,  list  every  word  in  the  docimient  collection. 

2.  Run  a  suffix-stripping  routine  to  convert  all  words  to  their  word  stems. 

3.  Sort  the  list  of  word  stems  and  write  as  a  vector  v . 

4.  For  every  word  stem,  t  j ,  remaining  in  the  document  d, ,  compute  a  weighting  factor, 
Wjj ,  the  weight  of  term  t  j  in  docimient  dj . 

a)  The  weighting  factor  can  be  computed  by  w  jj  =  tf;  ^  •  logl^y^f  j  which  is 
composed  of  the  term  frequency  and  inverse  document  frequency  fector. 

b)  The  weighting  factor  can  be  computed  by  w^j  =  ti-  -  ■  dvj  which  is  composed  of 
the  term  frequency  and  the  discrimination  value  of  term,  t  j ,  for  document  dj . 

c)  The  weighting  factor  can  be  computed  by  w^  =  tfj  j  •  trj  which  is  composed  of  the 
term  frequency  and  the  term-relevance  weight  of  term,  t  j ,  for  document  dj . 

5.  Eliminate  all  word  stems  in  the  vector  v  with  Wjj  <  q ,  where  q  is  a  chosen  threshold 
value. 

6.  For  each  document  d;  in  the  collection,  construct  the  term  weight  vector  aj  as  the 
weights  of  each  word  stem  in  the  place  where  the  stem  occurs  in  v  and  a  zero  in  all 
other  places. 

One  of  the  problems  inherent  in  the  basic  VSM  is  that  there  are  no  guidelines  in 
choosing  the  similarity  measure  and  that  choice  is  left  to  the  user  (Sahon  1989).  Wong  et 
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al.  (1988,  Bollmann  and  Wong  1987,  Wong  and  Yao  1990)  address  this  problem  by 
developing  ranking  strategies  based  on  user  preference  instead  of  the  idea  of  relevance  of 
documents.  Two  ranking  strategies  are  developed  called  the  perfect  and  acceptable 
rankings.  These  ranking  strategies  are  found  to  have  a  great  impact  on  information 
retrieval  strategies.  Wang  et  al.  (1992)  develop  the  necessary  and  sufiScient  conditions 
for  linear  similarity  measures  used  for  ranking  documents  to  produce  an  acceptable 
ranking  strategy.  The  authors  also  examine  the  geometric  properties  of  the  various  linear 
similarity  measures  and  analyze  the  structure  of  the  solution  vectors.  The  work  in  Wang 
et  al.  (1992)  extends  the  work  in  Wong  et  al.  (1988,  Bollmann  and  Wong  1987,  Wong 
and  Yao  1990). 

As  with  any  model,  measures  are  needed  to  determine  the  performance  of  the 
model.  Recall  and  precision  are  measures  that  are  often  used  to  determine  the 
performance  of  information  retrieval  methods.  Documents  can  be  partitioned  in  a 
document  collection  into  the  set  of  docimients  relevant  to  the  query,  A  ,  and  the  set  not 
relevant  to  the  query,  A  (Salton  1968).  Additionally,  documents  can  be  partitioned  into 
the  set  retrieved  by  the  system  according  to  the  query,  B ,  and  the  set  not  retrieved,  B . 
Recall  is  defined  as  the  ratio  of  the  number  of  documents  retrieved  that  are  relevant  to  the 
user's  query  to  the  total  number  of  documents  that  are  relevant  in  the  document 
collection.  Assume  |a|  is  a  count  of  the  number  of  members  in  set  A .  Recall  is  given 
by 

lAnBl 
RecaU  =  '  ,  '. 
A 
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Precision  is  the  ratio  of  the  number  of  documents  retrieved  that  are  relevant  to  the  user's 
query  to  the  total  number  of  docimients  retrieved.  Precision  is  given  by 

IaobI 

Precision  — r- ; — -■ 
|B| 

There  exists  a  tradeoflf  between  the  level  of  recall  achieved  and  the  level  of  precision. 

Salton  et  al.  (1975)  describe  using  the  vector  space  model  to  classify  documents 
based  on  their  similarity  of  terms.  They  determine  how  to  best  configure  the  document 
space  through  automatic  indexing.  Ideally,  documents  that  are  similar  in  index  terms 
according  to  some  similarity  measure  will  be  close  spatially  and  documents  deemed 
dissimilar  are  not  spatially  close  to  one  another.  Their  paper  examines  the  relationship 
between  the  density  of  the  document  space  and  the  performance  of  indexing  techniques. 

The  paper  by  Salton  et  al.  (1975)  describes  two  methods  of  configiiring  the 
document  space.  The  first  involves  clustering  documents  based  on  whether  a  given  set  of 
documents  is  often  used  simultaneously  in  response  to  a  user's  query.  If  one  document  in 
a  cluster  is  deemed  similar  to  a  particular  query,  the  "neighbors"  in  the  cluster  should  also 
be  returned  in  response  to  the  query.  The  problem  with  the  clustering  method  of 
document  space  configuration  is  that  the  retrieval  history  for  the  document  collection 
must  be  known  in  order  to  ascertain  the  user's  input  about  the  relevance  of  the  documents 
in  relation  to  the  query.  The  next  best  document  indexing  method  suggested  by  Salton  et 
al.  (1975)  is  to  maximize  the  space  between  the  set  of  documents.  This  is  achieved  by 
minimizing  the  sum  of  all  similarity  measures  between  distinct  documents  over  the  entire 
set  of  documents.  The  problem  with  this  approach  is  that  the  order  of  complexity  of  the 
solution  is  n^ .  Although  this  is  polynomial,  n  is  usually  very  large  and,  hence,  a 
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clustered  document  space  is  used  in  all  further  studies  on  the  document  space  and 
indexing. 

Salton  et  al.  (1975)  examine  the  correlation  between  performance  of  the  indexing 
technique  and  the  density  of  the  document  space.  Based  on  their  study,  lower  document 
space  density  seems  to  result  in  better  recall-precision  performance.  Continuing  the 
exploration  of  the  relationship  between  density  and  performance,  Salton  et  al.  (1975) 
examine  the  effects  of  changing  the  document  space  to  determine  whether  the  changes 
cause  differences  in  the  recall-precision  performance.  This  is  done  by  increasing  the 
similarity  within  clusters  and  decreasing  the  between  cluster  similarity.  The  results  again 
indicate  that  dense  document  spaces  correspond  to  lower  recall  and  precision  levels. 

Salton  et  al.  (1975)  also  discuss  a  discrimination  value  model  (DVM)  first 
introduced  by  Salton  and  Yang  (1973)  and  Salton  (1975).  The  value  of  an  index  term  is 
based  on  its  ability  to  discriminate  among  documents  by  increasing  the  difference  among 
document  vectors  when  the  term  is  assigned  as  an  index  term  in  the  document  collection 
(Salton  and  Yang  1973,  Salton  1975). 

The  SMART  retrieval  system  is  a  document  retrieval  system  used  as  an 
experimental  tool  for  evaluating  various  search  procedures  (Salton  1971).  The  system 
was  designed  at  Harvard  between  1961  and  1964  and  was  fiilly  implemented  at  both 
Harvard  and  Cornell.  The  SMART  system  at  Cornell  consists  of  five  sections:  (1)  text 
input,  (2)  clustering  or  grouping  documents,  (3)  selection  of  document  groups  for  search, 
(4)  searching  and  (5)  evaluation  (Williamson  et  al.  1971).  A  simplified  SMART  system 
flowchart  is  provided  in  Salton  and  McGill  (1983,  p.  129).  Experiments  with  the 
SMART  system  have  been  done  on  several  document  collections:  lRE-3  abstracts  in 
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computer  literature,  CRAN-1  a  collection  of  documents  in  aerodynamics,  Ispra 
documents  in  documentation,  ADI  short  papers  in  documentation  and  Medlars  a 
collection  of  documents  in  medicine  (Salton  1971). 

The  results  of  the  experiments  are  summarized  by  Salton  (1971)  according  to 
document  length,  term  weights  and  matching  functions,  word  normalization,  dictionaries 
with  synonym  recognition,  phrase  generation  methods,  hierarchical  procedures,  fully 
automatic  versus  maniial  text  processing,  feedback  procedures  and  partial  cluster 
searches.  The  longer  the  length  of  the  text  examined  to  match  queries  to  documents,  the 
better  the  retrieval  performance.  However,  the  improvement  in  performance  is  not 
enough  to  conclude  that  a  full  text  search  is  always  warranted  over  searching  abstracts 
only.  Weighted  terms  are  more  effective  for  describing  the  content  of  a  document  than 
terms  without  weights.  The  cosine  measure  of  similarity  between  a  document,  d, ,  and  a 
query,  q ,  is  more  useful  than  an  overlap  function.  Further  improvements  can  be  made  by 
using  more  complex  measures  of  similarity  and  synonym  recognition.  Word 
normalization  is  most  useful  when  documents  contain  non-technical,  redundant 
vocabulary.  Phrase  generation  methods  and  hierarchical  procedures  are  not  of  sufficient 
use,  in  general,  to  merit  implementing  them  in  automatic  systems.  User  feedback 
procedures  substantially  improve  performance  of  subsequent  iterations  of  search.  Search 
time  can  be  reduced  greatly  by  using  partial  cluster  searches  of  selected  groupings  of 
documents  as  opposed  to  a  complete  match  of  the  query  with  all  documents  stored. 
Manual-indexing  methods  were  not  found  to  be  substantially  superior  to  fiilly  automatic 
text  processing.  Finally,  Salton  (1971)  ranks  the  tested  procedures  on  their  order  of  merit 
as  follows,  with  the  most  effective  listed  first: 
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(1)  abstract  processing  with  phrase  and  synonym  recognition,  (2)  weighted 
word  stem  matching  and  statistical  word  associations  using  abstracts  for 
analysis  purposes,  (3)  logical  word  stem  matching  disregarding  term 
weights  and  (4)  title  processing  using  only  document  titles  for  analysis 
purposes,  and  document-request  matching  based  on  overlap  function. 
(Salton  1967,  p.  6) 

According  to  Raghavan  and  Wong  (1986),  in  order  to  use  the  VSM,  the  various 
components  of  the  model  must  be  specified.  The  dimension  of  the  space  is  assumed  to  be 
n ,  the  number  of  distinct  terms.  This  need  not  be  the  case.  The  term-term  correlations 
need  to  be  provided.  The  simplifying  assumption  in  the  literature  is  that  the  term  vectors 
are  pairwise  orthogonal.  Additionally,  an  interpretation  of  the  elements  of  D  needs  to  be 
given.  Typically,  the  elements  of  D  are  interpreted  as  the  components  of  document 
vectors  along  the  direction  of  the  term  vector  (Salton  1989). 

Raghavan  and  Wong  (1986)  present  the  vector  space  model  relaxing  the 
assiunption  of  term  vectors  being  pairwise  orthogonal.  In  this  case,  they  illustrate  the 
difference  between  components  of  documents  along  terms  and  projections  of  documents 
along  terms.  They  also  discuss  the  standard  vector  space  model  where  term  vectors  are 
pairwise  orthogonal.  They  illustrate  that  in  this  model  the  elements  of  D  are  both 
projections  and  components  of  documents  along  term  vectors.  The  authors  discuss  the 
dual  of  the  standard  vector  space  model,  where  the  elements  of  D  are  interpreted  as 
components  of  terms  along  documents,  as  opposed  to  documents  along  terms.  A  case  is 
made  for  the  use  of  negative  term  correlations  in  the  model,  where  a  negative  correlation 
between  terms  would  indicate  the  degree  to  which  the  terms  are  "opposite". 

Raghavan  and  Wong  (1986)  present  several  shortcomings  and  misconceptions 
involving  the  VSM  to  date  along  with  clarifications  in  the  interpretation  of  the 
components  of  the  VSM.  One  of  the  points  made  is  that  the  linear  dependence  of  term 
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vectors,  orthogonality  and  dimensionality  of  the  vector  space  are  concepts  that  are  closely 
related.  If  all  of  the  term  vectors  are  uncorrelated,  meaning  the  terms  are  pairwise 
orthogonal,  then  the  set  of  term  vectors  is  linearly  independent.  However,  the  converse  is 
not  necessarily  true.  Linear  independence  of  the  term  vectors  does  not  imply 
uncorrelated  terms.  "Rather,  linear  independence  only  implies  that  any  redundancy  in  the 
usage  of  terms  has  been  removed  and  the  representation  in  terms  of  the  resulting  set  of 
vectors  is  contact  (and  unique)."  (Raghavan  and  Wong  1986,  pp.  286)  In  much  of  the 
earlier  literature  the  distinction  between  linear  independence  and  non-orthogonality  is  not 
present. 

Vector  space  representations  that  use  linearly  dependent  term  vectors  may  have 
terms  that  can  be  removed  without  losing  information  (Raghavan  and  Wong  1986).  This 
idea  stems  from  the  linear  algebra  concept  that  if  a  vector  space  is  spanned  by  a  linearly 
dependent  set  of  vectors,  then  the  vector  space  can  be  spanned  by  a  linearly  independent 
subset  of  the  vectors. 

Raghavan  and  Wong  (1986)  demonstrate  that  choosing  to  interpret  the  elements 
of  D  as  ". . .  both  the  component  of  documents  along  term  vectors  as  well  as  the 
components  of  terms  along  document  vectors  is  inconsistent."  (Raghavan  and  Wong 
1986,  p.  287)  To  illustrate  why  this  interpretation  is  inconsistent,  the  authors  discuss  the 
standard  VSM,  where  D  is  interpreted  as  the  component  of  documents  along  term 
vectors,  D  =  A ,  compared  to  the  dual  of  the  standard  VSM,  where  D  is  interpreted  as 
components  of  terms  along  document  vectors,  D  =  B' .  The  problem  that  arises  when 
both  interpretations  are  used,  as  is  sometimes  done  in  the  literature  according  to  the 
authors,  is  that  A  =  B' .  This  is  only  true  in  the  special  case  where  both  the  term-term 
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correlation  matrix  and  the  document-document  correlation  matrix  are  identity  matrices. 
As  the  authors  point  out,  this  case  is  uninteresting. 

Several  papers  have  been  written  to  expand  the  vector  space  model  in  the  face  of 
several  shortcomings  presented  by  the  basic  model.  In  Wang  et  al.  (1992),  necessary  and 
sufficient  conditions  are  identified  for  ranking  documents  imder  the  various  similarity 
measures.  In  order  to  relax  the  assumption  of  pairwise  orthogonality  of  the  term  vectors, 
correlation  between  term  vectors  must  be  known  (Wong  et  al.  1987,  Raghavan  and  Wong 
1 986).  Crouch,  Crouch  and  Nareddy  (1 990)  provide  a  method  of  automatically 
constructing  an  extended  query  containing  multiple  concept  classes  based  on  a  query 
issued  with  only  a  single  term.  Kleinberg  and  Tomkins  (1999)  discuss  how  to  deal  with 
the  problem  of  the  high  dimensionality  inherent  in  the  VSM  by  a  dimension  reduction 
technique.  Latent  Semantic  Indexing.  The  idea  of  Latent  Semantic  Indexing  was 
originally  introduced  in  Deerwester  et  al.  (1990).  Additionally,  the  authors  contrast  this 
technique  with  other  approaches  to  the  high  dimensionality  problem,  including  clustering 
and  vector  space  dimension-reduction.  The  authors  also  provide  a  discussion  on 
documents  on  the  World  Wide  Web,  where  there  is  a  link-structure  present.  Henrich 
(1996)  discusses  spatial  access  methods  and  their  application  to  the  document  vectors  in 
the  VSM.  Henrich's  paper  (1996)  also  has  to  deal  with  the  problem  of  dimensionality. 
Spatial  access  methods,  such  as  nearest  neighbor  and  distance  scan  queries,  assume  a 
small  dimension,  while  the  VSM  typically  involves  very  high  dimensionality. 

Wong  et  al.  (1987)  developed  a  generalized  vector  space  model  (GVSM).  The 
method  of  computing  term-term  similarity  is  based  on  term  co-occurrence  using  three 
ideas.  First,  a  term  or  concept  is  defined  by  the  documents  that  contain  that  concept.  A 


31 

term  is  unrelated  to  another  term  if  the  set  of  documents  containing  the  term  does  not 
intersect  with  the  set  containing  the  other  term.  Finally,  the  more  documents  in  the 
intersection  between  two  terms,  the  greater  the  similarity  measure  between  them.  The 
GVSM  is  presented  and  theoretically  justified  based  on  the  previously  stated  ideas  behind 
measuring  term-term  similarity. 


Table  3.  Variables  for  two-group  linear  discriminant  analysis 


Variable 

Description 

P 

Number  of  attributes 

X 

Column  vector  p  x  1  of  attributes 

N 

Number  of  observations,  N  =  N,  +  Nj 

N, 

Nonzero  number  of  observations  fi-om  group  i ,  i  =  1, 2 

X, 

NjXp   matrix  whose  jth  row  is  the  transpose  of  the  jth 

observation  vector  fi-om  group  i ,  i  =  1, 2 ,  j  =  1, . . . ,  Nj 

Group/class  i ,  i  =  1, 2 

Scalar  cutting  score  for  group  i  classification,  i  =  1, 2 

w 

Column  vector  pxl  of  coefficients  determined  by  discriminant 

analysis 

Column  vector  N;  x  1  of  ones,  i  =  1,  2 

2.3  Discriminant  Analysis 

Discriminant  analysis  is  a  technique  used  to  separate  observations  into  classes  or 
to  assign  new  observations  to  already  existing  classes  based  on  a  discriminant  function  or 
discriminant  score  (Johnson  and  Wichem  1982).  In  it's  linear  form,  discriminant  analysis 
separates  observations  via  one  or  more  hyper-planes.  The  general  objective  of 
discriminant  analysis  is  to  find  a  hyper-plane  that  best  separates  the  observations.  The 
method  for  finding  the  hyper-plane,  the  specific  objective,  the  criteria  used  to  find  cutting 
score  z  and  the  underlying  assumptions  are  different  for  each  variation  of  the 
discriminant  analysis  methods.  "All  of  the  models  determine  a  linear  discriminant 
fimction  (LDF)  by  optimizing  some  criterion  that  is  invariably  a  surrogate  for  minimizing 
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the  number  of  misclassifications."  (Koehler  and  Erenguc  1990,  p.  63)  An  observation  x 
is  misclassified  if  the  discriminant  function  places  x  in  the  wrong  group  (Koehler  and 
Erenguc  1990). 

First,  a  discussion  of  the  notation  used  in  this  section  is  provided  in  Section  2.3.1. 
Second,  a  description  of  Fisher's  (1936)  linear  discriminant  function  is  given  in  Section 
2.3.2.  Finally,  several  linear  programming  variations  are  presented  in  Section  2.3.3. 

2.3.1  Notation 

For  all  variations  of  two-group  discriminant  analysis,  the  discriminant  function  is 
found  based  on  observations  for  which  the  group  classification  is  known.  This  group  of 
observations  is  called  the  training  sample  (Koehler  and  Erenguc  1990).  Adopting  the 
notation  fi-om  Koehler  and  Erenguc  (1990)  and  Johnson  and  Wichem  (1982),  Table  3 
describes  the  variables  from  the  observations  in  the  training  sample.  The  classification  of 
a  column  vector  of  observations  x  is  assigned  to  tt,  if  w'x  >  z,  and      ^  ^'^  -  ^2  • 
most  variation  z,  and  z^  are  equal.  In  the  case  where  they  are  equal,  observations  that 
lie  on  the  hyperplane  are  arbitrarily  classified  into  K^  or  . 

2.3.2  Fisher's  Linear  Discriminant  Function 

Fisher  originally  proposed  discriminant  analysis  in  a  paper  in  1936  (Fisher  1936). 
The  assumptions  in  Fisher's  (1936)  linear  discriminant  analysis  method  are  (1)  the  data 
are  multivariate,  normally  distributed  and  (2)  the  variance-covariance  matrices  of  the  two 
groups  of  data  are  known  and  equal.  Fisher  (1936)  decided  to  use  the  linear  combination 
of  x  that  maximized  the  ratio  of  the  distance  between  the  means  to  the  variance  in  the 
Y 's  given  by: 
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ay 

where  [i,y  is  the  mean  of  the  Y 's  obtained  from  X 's  belonging  to  7t, ,  ^Ijy  ^  the  mean 
of  the  Y 's  obtained  from  X  's  belonging  to  71 2  and  a\  is  the  variance  in  the  Y  's. 

Fisher's  linear  discriminant  fimction  is  the  maximum  of  the  above  ratio.  The  fimction  is 
given  by 

-^2)  Z"'X. 

The  covariance  matrices  are  assumed  to  be  the  equal  for  both  classes  of  objects.  The 
covariance  matrix  is  given  by 

E  =  E(X-n,XX-^.)',      i  =  l,2. 
When  n,  ,^2 '  and  E  are  unknovsoi  Fisher's  sample  linear  discriminant  fimction  is  used. 
The  fimction  is  given  by 

y  =  (x, -X2)S^,^x 
_(n,-lK+(n2-lH 

"Spooled  /'„     ,  „  '^\ 

(n, +n2-2j 

The  cutting  score  z  depends  on  a  point  between  y^  =  b'x^  and  yj  =  b^2  •     the  group 
sizes  of  the  two  classes  are  equal,  use  the  midpoint  to  determine  the  cutting  score  z .  If 
the  group  sizes  are  unequal,  use  a  weighted  average  method. 
2.3.3  Linear  Programming  Approach  to  Linear  Discriminant  Analysis 

Mangasarian  (1965)  and  Rosen  (1965)  proposed  mathematical  programming 
methods  for  discriminant  analysis.  Later,  Hand  (1981)  and  Freed  and  Glover  (1981a, 
1981b)  independently  examined  the  linear  programming  approach  to  the  discriminant 
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problem  in  more  depth.  The  advantage  of  using  a  linear  programming  approach  is  the 
lack  of  assunptions  that  are  present  in  Fisher's  LDF.  Various  linear  programming 
approaches  show  promise,  especially  when  the  assumptions  for  Fisher's  LDF  are  violated 
(Koehler  1989a,  1989b,  Ragsdale  and  Stam  1991).  Freed  and  Glover  (1981a)  proposed 
two  LP  formulations  of  the  discriminant  analysis  problem,  the  minimize  the  maximum 
deviations  (MMD)  model  and  the  minimize  the  sum  of  deviations  (MSD)  model.  Hand 
(1981)  proposed  the  perceptron  model,  a  version  of  an  MSD  model  that  appeared 
independently  and  in  the  same  year  as  the  MSD  model  proposed  by  Freed  and  Glover 
(1981a).  Freed  and  Glover  (1981b)  proposed  another  model  called  the  minimize  the  sum 
of  interior  distances  (MSID)  model. 

The  MMD  formulation  is  discussed  in  Section  2.3.3. 1 .  The  MSD  formulation  is 
discussed  in  Section  2.3.3.2.  In  Section  2.3.3.3  a  discussion  of  problems  that  occur  in  the 
formulations  is  given.  Finally,  in  Section  2.3.3.4  some  alternative  formulations  are 
discussed  to  alleviate  the  problems  discussed  in  Section  2.3.3.3. 
2.3.3.1  The  MMD  formulation 

The  MMD  formulation  (Freed  and  Glover  1981a)  the  objective  is  to  minimize  the 
maximal  violations  of  misclassified  observations.  The  original  MMD  formulation  (Freed 
and  Glover  1981a)  is 

Maximize  d 

Subject  to 

X,w-dl,  >cl, 
X2W  +  dl2  <cl2 

w,d  unrestricted  in  sign 
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c  positive  constant 

If  d  >  0 ,  then  perfect  group  separation  is  achieved.  If  d  <  0 ,  then  there  is  some  overlap 
in  the  groups,  but  the  overlap  is  minimized.  If  d  =  0 ,  then  the  two  groups  share  the 
points  on  the  hyperplane.  Several  variations  of  the  MMD  formulations  are  provided  in 
Freed  and  Glover  ( 1 986a,b). 
2.3.3.2  The  MSD  formulation 

The  MSD  formulation  (Freed  and  Glover  1981a)  minimizes  the  sum  of  the 
(weighted)  violations.  One  of  the  original  MSD  formulations  given  by  Freed  and  Glover 
(1981a)  is 

Maximize  1,  d,  +  Ij  dj 

Subject  to 

X,w-d,  >cl, 

XjW  +  dj  <  clj 

w,  d , ,  d  2  unrestricted  in  sign 

c  positive  constant 

The  original  MSD  formulation  proposed  independently  by  Hand  (1981)  is 

Minimize  1,  d,  +  Ij 

Subject  to 

X,w  +  cl,  +d,  >  b, 
X^w  +  clj-d^  <-h^ 
dpd^  >0 

w,c  unrestricted  in  sign 
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b, ,  bj  positive  constants 

Other  MSD  formulations  are  given  in  Freed  and  Glover  (1986a)  and  Bajgier  and  Hill 
(1982). 

2.3.3.3  Discussion  of  problems  with  formulations 

There  are  problems  with  some  of  the  LP  approaches:  unbounded  solutions, 
improper  solutions,  unacceptable  solutions  and  instability  of  solutions  when  performing  a 
linear  transformation  of  the  data  (Ragsdale  and  Stam  1 991 ).  A  solution  is  said  to  be 
unbounded  in  LP  if  the  objective  fimction  can  be  increased  or  decreased  without  limit. 
An  improper  solution  occurs  when  all  observations  lie  on  the  classification  hyper-plane 
(Ragsdale  and  Stam  1991).  If  a  solution  ". . .  generates  a  discriminant  function  of  zeros, 
in  which  case  all  observations  will  be  classified  in  the  same  group  . . ."  then  it  is  an 
unacceptable  solution.  (Koehler  1989a,  p.  241)  Koehler  (1989a)  characterizes 
unacceptable  solutions  for  the  various  versions  of  the  MMD,  MSD  and  MSID  models. 
Some  LP  versions  of  discriminant  analysis  have  provided  different  solutions  to  the  LP 
problem  under  linear  transformation  of  the  data  (Markowski  and  Markowski  1985). 
When  this  problem  occurs,  the  solution  is  instable  under  linear  transformation. 

Glorfield  and  Gaither  (1982)  criticize  the  LP  approach  offered  by  Freed  and 
Glover  (1981a).  The  criticisms  involve  the  simplicity  of  the  LP  model,  the  appropriate 
role  of  the  LP  model,  the  adaptability  of  the  LP  model  to  more  than  the  two-group 
problem  and  ability  of  the  LP  model  to  compete  with  existing  approaches.  Freed  and 
Glover  answer  the  criticisms  in  (Freed  and  Glover  1982). 
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2.3.3.4  Alternative  formulations 

Several  alternative  LP  formulations  of  the  discriminant  problem  have  been 
offered  in  response  to  the  problems  discussed  in  2.3.3.3.  In  an  attempt  to  avoid 
unacceptable  solutions,  Glover  et  al.  (1988)  suggest  a  hybrid  formulation  called  the 
HDM.  Upon  the  discovery  by  Koehler  (1991)  that  the  formulation  did  not  completely 
solve  the  problem  of  unacceptable  solutions,  Glover  (1990)  refined  the  HDM.  The 
refined  version  does  solve  the  problems  that  plagued  earlier  versions. 

However,  due  to  the  increased  conqjlexity  of  the  refined  version  of  the  HDM, 
Ragsdale  and  Stam  (1991)  offer  simplified  formulations  of  the  earlier  MMD  and  MSD 
models  based  on  using  a  classification  gap.  The  formulation  based  on  the  MMD  is  called 
the  epsilon  MMD  or  the  EMMD  and  is  given  by 

Minimize  d 

subject  to 

Wfll,  +X,w-dl,  <0 
Wgl^  H-XjW  +  dlj  >el2 
d>0 

Wo,w  unrestricted  in  sign 

where      is  the  intercept  and  8  =  1  is  the  gap  between  two  hyper-planes  found  as  a 

solution.  The  epsilon  MSD  (EMSD)  formulation  developed  in  Ragsdale  and  Stam  (1991) 
is  given  by 

Minimize  1,  d,  +  d.^ 

subject  to 

j 

1 
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w^l,  +X,w-d,  <0 

d„d2>0 

w  0 ,  w  umestricted  in  sign. 

Additionally,  Koehler  and  Erenguc  (1990)  provide  a  mixed  integer  mathematical 
programming  formulation  to  minimize  the  number  of  misclassifications. 

2.4  Financial  Literature 
There  are  several  recent  papers  investigating  the  relationship  between  news  and 
stock  returns.  Chan  (in  press)  studies  the  return  patterns  for  a  set  of  stocks  with  public 
news  releases  (news  stocks)  versus  a  set  of  stocks  with  similar  monthly  returns  without 
news  releases  (no-news  stocks).  He  finds  that  there  is  a  major  difference  in  the  return 
patterns  for  news  versus  no-news  stocks.  Specifically,  news  stocks  with  negative  returns 
underperformed  their  peers,  positive  news  stocks  experience  less  price  drift  and  extreme 
return  no-news  stocks  experience  reversal  in  the  subsequent  month  with  little  abnormality 
after  that. 

In  a  recent  study  by  Daniel  and  Titman  (Daniel,  K.,  &  Titman,  S.  2001.  Market 

reactions  to  tangible  and  intangible  information.  Unpublished  manuscript.),  reactions  to 

tangible  and  intangible  information  in  the  market  are  examined. 

Tangible  information  consists  of  explicit  performance  measures, 
like  sales,  earnings  and  cash  flows,  which  can  be  observed  in  the 
firms'  accounting  statements.  Intangible  information,  in  contrast, 
is  that  part  of  the  stock's  past  return  that  cannot  be  linked  directly 
to  accounting  numbers,  but  which  presumably  reflects  expectations 
about  fiiture  cash  flows.  (Daniel,  K.,  &  Titman,  S.  2001 .  Market 
reactions  to  tangible  and  intangible  information.  Unpublished 
manuscript,  p.  3) 
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They  find  evidence  that  investors  overreact  to  intangible  information,  but  not  to  tangible 
information.  To  explain  these  results,  Daniel  and  Titman  (Daniel,  K.,  &  Titman,  S.  2001. 
Market  reactions  to  tangible  and  intangible  information.  Unpublished  manuscript.) 
highlight  work  in  psychology  that  suggests  individuals  are  overconfident  when 
interpreting  information  in  settings  where  more  judgment  is  required  and  short-run 
feedback  on  the  value  of  the  judgment  is  unavailable  (Einhom  1980). 

In  a  paper  by  Huberman  and  Regev  (2001),  the  events  surrounding  an  in-depth 
news  story  in  The  New  York  Times  on  Sunday,  May  3  1998  concerning  a  breakthrough  in 
cancer  research  by  Entremed  Inc.  (ENMD)  are  examined.  After  the  story,  the  stock  price 
of  ENMD  increased  by  330%  fi-om  Friday-close  to  Monday-close  and  the  price  of  stocks 
in  the  biotechnology  sector  also  increased  significantly.  Interestingly,  the  scientific 
breakthrough  discussed  in  the  story  had  already  been  published  in  Nature  and  in  other 
sources  in  the  press  five  months  prior  to  the  article  in  the  Times,  with  a  much  milder 
reaction  with  respect  to  stock  price  and  no  spillover  to  the  rest  of  the  biotechnology 
sector. 

2.5  Text  Classification 

The  problem  of  analyzing  the  content  of  text-based  documents  for  classification 
purposes  is  not  new.  The  purpose  of  text  categorization  is  to  classify  each  document  in  a 
document  collection  into  zero  to  multiple  categories  based  on  a  predefined  set  of 
categories.  Many  algorithms  have  been  written  for  the  text  classification  problem 
including  CONSTRUE  (Hayes  &  Weinstein  1990),  DTree  (Lewis  &  Ringuette  1994), 
NaiveBayes  (Lewis  &  Ringuette  1994),  SWAP-1  (Apte  et  al.  1994),  Nnets  (Wiener  et  al. 
1995),  Rocchio  (Rocchio  1971),  k-NN  (Hayes  &  Weinstein  1990)  and  support  vector 
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machines  (SVM)  introduced  by  Vapnik  et  al.  (Vapnik  1995,  Cortes  &  Vapnik  1995). 
Many  of  the  algorithms  are  based  on  statistical  learning  methods  and  some  are  based  on 
the  VSM  (Salton  1968).  The  process  discussed  in  this  dissertation  is  based  on  the  VSM 
(Salton  1968)  in  accordance  with  the  most  popular  text  categorization  algorithm,  Rocchio 
(Rocchio  1971)  also  based  on  the  VSM. 

In  Rocchio  (1971),  each  docimient  vector  is  categorized  according  to  a  set  of 
prototype  vectors.  Each  predefined  category  has  a  prototype  vector  constructed  based  on 
a  training  set  of  documents.  Each  docimient  is  ranked  according  to  a  similarity  measure 
comparison  between  the  document  vector  and  each  prototype  category  vector. 
CONSTUE  (Hayes  &  Weinstein  1990)  is  a  rule-based  expert  system  used  to  categorize 
Reuters  news  stories.  DTree  (Lewis  &  Ringuette  1 994)  methods  of  classification  are 
based  on  decision  trees.  NaiveBayes  (Lewis  &  Ringuette  1994)  is  a  text  classification 
method  based  on  a  naive  Bayes  model.  SWAP-1  (Apte  et  al.  1994)  uses  rules  with  an 
inductive  learning  algorithm  for  classification.  Wiener  et  al.  (1995)  use  neural  networks 
(NNets)  to  solve  the  text  classification  problem.  Hayes  and  Weinstein  (1990)  developed 
a  k-nearest  neighbor  (k-NN)  approach  to  solving  the  classification  problem.  Yang  (1999) 
evaluates  the  aforementioned  methods  of  text  categorization  on  two  document 
collections,  the  Reuters  corpus,  newswire  stories  collected  fi-om  1987  to  1991  and  the 
OHSMED  corpus  developed  at  the  Oregon  Health  Sciences  University  by  William  Hersh 
and  colleagues.  Yang  (1999)  finds  that  of  the  aforementioned  methods,  k-NN  and  NNets 
were  the  top  performers.  Support  vector  machines  were  introduced  by  Vapnik  et  al. 
(Vapnik  1995,  Cortes  &  Vapnik  1995)  and  are  based  on  statistical  learning  theory. 
Joachims  (1998)  compares  the  performance  of  support  vector  machines  to  the  Rocchio 
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algorithm,  the  naive  Bayes  classifier,  the  k-NN  classifier  and  the  C4.5  decision-tree  rule 
learner  (Quinlan  1993).  Joachims  (1998)  concludes  that  support  vector  machines 
outperform  the  methods  to  which  they  were  compared  significantly. 

The  first  problem  encountered  in  text  categorization  is  how  to  represent  the 
documents.  Salton  (1968)  developed  the  VSM  to  represent  documents  and  queries  issued 
by  users  as  vectors.  Using  vectors  to  represent  documents  provides  a  quantitative 
approach  to  the  problems  of  information  retrieval  and  text  categorization.  The  basic  idea 
in  the  VSM  is  to  convert  documents  to  vectors  by  first  converting  each  word  in  the 
document  to  it's  word  stem  and  then  constructing  the  document  vector  by  counting  the 
frequency  of  each  word  stem  in  that  document. 

After  a  suitable  document  representation  has  been  decided,  the  next  problem  in 
text  classification  is  the  method  of  categorization.  As  the  VSM  provides  a  linear 
representation  of  documents,  it  is  natural  to  categorize  the  documents  via  linear 
discriminant  analysis  (LDA)  or  natural  generalizations  such  as  support  vector  machines 
(Vapnik  1995,  Cortes  &  Vapnik  1995).  This  is  in  contrast  to  the  categorization  method 
used  in  the  Rocchio  (Lewis  et  al.  1996)  algorithm. 


CHAPTER  3 
RESEARCH  PROBLEM 

As  discussed  in  Section  2.1,  environmental  scanning  is  very  important  to  the 
success  of  a  corporation.  Many  studies  support  that  environmental  scanning  improves  an 
organization's  performance  (Miller  and  Friesen  1977,  Newgren  et  al.  1984,  Dollinger 
1984,  West  1988,  Daft  et  al.  1988,  Subramanian  et  al.  1993,  Subramanian  et  al.  1994, 
Murphy  1987,  Ptaszynski  1989).  Choo  summarizes  that ". . .  environmental  scanning  is 
increasingly  being  used  to  drive  the  strategic  planning  processes  by  business  and  public 
sector  organizations  in  most  developed  countries."  (Choo  1995,  p.lOl)  Additionally, 
with  the  development  of  the  hitemet,  the  amount  of  information  easily  available  to  a 
corporation  is  vast.  "Companies  of  all  shapes  and  sizes  are  finding  that  the  hitemet 
provides  new  opportunities  for  competitive  advantage."  (Cronin  1993,  p.  40-43) 

Mformation,  both  external  and  internal,  is  a  vital  part  of  making  strategic 
decisions  (Pawar  and  Sharda  1997).  However,  Pawar  and  Sharda  (1997)  warn  that 
unsystematic  information  gathering  fi-om  the  hitemet  wastes  time  and  money.  Hence, 
one  problem  becomes  acquiring  useful  information  and  mining  it  for  what  is  relevant  in  a 
systematic  automated  manner.  Pawar  and  Sharda  (1997)  suggest  that  an  hitemet-based 
environmental  scanning  system  can  offer  benefits,  but  also  has  costs.  The  benefits 
include  the  fimeliness,  low-cost  and  quantity  of  the  information  available.  However,  the 
cost  of  searching  can  be  quite  high.  Therefore,  a  second  problem  is  ufilizing  the 
information  available  via  the  Internet  with  minimal  cost.  The  environmental  scanning 
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process  in  this  dissertation  is  developed  with  the  problems  discussed  in  mind.  The 
process  of  collecting  documents  can  be  automated  and  be  as  simple  as  running  a  few  Java 
programs  on  a  regular  basis.  As  discussed  below,  we  investigate  methods  to  automate  the 
analysis  of  these  documents  using  information  retrieval  ideas  with  discriminant  analysis. 
These  methods  can  also  become  part  of  the  downloading  programs.  Thus,  the 
information  is  gathered  and  analyzed  systematically  with  minimal  cost. 

3.1  Problem  Setting 
As  scanning  involves  gathering  and  analyzing  information  from  the  environment 
and  the  Internet  provides  a  vast  amount  of  easily  available  html  documents  to  scan,  it  is 
natural  to  develop  a  process  to  scan  html  documents  for  information.  There  are  several 
approaches  that  can  be  employed  to  solve  the  problem  of  analyzing  the  content  of  html 
documents,  including  traditional  data  mining  or  machine  learning  methods.  However,  the 
vector  space  model  (VSM)  was  developed  specifically  to  solve  information  retrieval  (IR) 
problems  and  provides  a  convenient  method  for  quantifying  the  problem  of  content 
analysis. 

Much  of  the  literature  relating  to  the  VSM  involves  improving  retrieval  results 
based  on  a  user's  query  with  respect  to  measures  such  as  precision  and  recall.  However, 
in  this  dissertation  the  model  has  been  adapted  to  representing  web  documents  in  a  vector 
format  in  order  to  analyze  their  content  for  information  about  the  environment.  This 
content  analysis  is  accomplished  by  analyzing  a  set  of  documents,  called  a  training  set. 
Each  document  in  the  training  set  is  classified  according  to  a  categorical  variable  of 
interest.  The  information  gathered  by  the  training  set  is  then  used  to  determine  the  value 
of  the  categorical  variable  for  future  documents.  Therefore,  the  problems  that  remain  are 
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how  to  separate  documents  based  on  the  categorical  variable  and  how  to  predict  the  value 
of  the  categorical  variable  for  new  documents. 

As  the  vector  space  model  is  linear,  it  is  natural  to  separate  documents  based  on  a 
categorical  variable  with  linear  methods.  One  popular  linear  method  is  linear 
discriminant  analysis  (LDA).  LDA  is  traditionally  used  for  separation  of  observations 
into  groups  and  classification  of  new  observations.  The  various  LDA  formulations  find  a 
linear  discriminant  function  (LDF)  based  on  a  training  set  with  known  group 
membership.  The  LDF  can  be  used  to  classify  new  observations.  Therefore,  LDA  solves 
the  problems  of  how  to  separate  documents  based  on  a  categorical  variable  and  how  to 
predict  the  value  of  the  categorical  variable  for  new  documents. 

The  VSM  has  been  criticized  for  the  mathematical  inconsistencies  that  exist 
between  the  various  interpretations  that  have  been  used  in  the  literature  as  well  as  its  lack 
of  consistent  interpretation  and  use  (Raghavan  and  Wong  1986).  The  new  adaptation  of 
the  model  only  uses  the  linear  vector  representation  for  documents  and  terms,  without  the 
need  for  the  vector  operations  that  have  been  criticized  for  their  lack  of  theoretical 
justification.  Other  criticisms  are  the  lack  of  guidelines  for  choosing  the  similarity 
measure  and  the  assumed  orthogonality  of  terms,  an  assumption  arising  fi-om  the  extreme 
difficulty  of  determining  term-term  correlations  (Salton  1989).  In  our  adaptation,  no 
similarity  measure  is  needed,  as  there  are  no  queries.  The  literature  provides  theoretical 
discussion  of  how  correlated,  non-orthogonal  term  vectors  are  incorporated  into  the  basic 
VSM  (Raghavan  and  Wong  1986,  Salton  1989),  but  there  are  no  guidelines  on  how 
exactly  term-term  correlations  for  correlated,  non-orthogonal  vectors  are  determined.  In 
addition,  according  to  Salton  ".  . .  it  is  not  simple  to  generate  useful  term  associations." 
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(Salton  1989,  p.  314)  In  spite  of  the  criticism  of  the  orthogonality  of  terms,  there  have 
been  useful,  interesting  results  (Salton  1971,  Salton  1975,  Salton  and  McGill  1983). 
Therefore,  we  use  the  simplifying  assumption  of  orthogonality  of  terms. 

In  conclusion,  the  process  developed  in  this  dissertation  for  web-based 
environmental  scanning  is  simple,  directed  and  automated.  The  content  analysis  of  the 
set  of  collected  documents  is  based  on  the  well-documented,  frequently  used  VSM 
representation  developed  for  information  retrieval  combined  with  traditional  LDA  for 
separating  the  documents  based  on  a  categorical  variable.  Section  3.2  offers  a  list  of 
research  questions  based  on  this  process.  Section  3.3  describes  the  appHcation 
environment  used  to  empirically  analyze  our  environmental  scanning  process.  Finally, 
Section  3.4  provides  an  in  depth  description  of  the  VSM  adaptation  and  methods  for 
determining  the  linear  discriminant  function  in  our  application  envirormient. 

3.2  Research  Questions 

Based  on  the  description  of  the  need  for  an  automated  web-based  environmental 
scanning  process  and  the  explanation  of  the  VSM  and  LDA  as  the  tools  used  to  develop 
that  process,  two  general  research  questions  are: 

RQl :  How  well  does  the  process  classify  or  group  the  training  set  of  documents  based  on 
a  categorical  variable? 

RQ2:  Does  the  process  predict  the  correct  classification  better  than  random  guessing? 

3.3  Application  Environment 
The  relationship  between  stock  returns  and  news  is  a  hot  topic  in  the  financial 
literature  with  several  very  recent  papers  in  the  financial  literature  examining  this 
relationship  (Huberman  and  Regev  2001,  Chan  in  press,  Daniel,  K.,  &  Titman,  S.  2001. 
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Market  reactions  to  tangible  and  intangible  information.  Unpublished  manuscript.). 
Additionally,  data  about  publicly  traded  companies  is  readily  available  via  the  Internet,  so 
the  envirormiental  scanning  process  developed  in  this  dissertation  is  empirically  tested  in 
the  stock  market.  Online  news  articles  about  specific  stocks  are  collected  for  simple 
signals,  as  indicated  by  the  terms  used  in  the  articles,  about  future  stock  returns  or 
changes  in  trading  volume.  A  list  of  web  sites  used  to  collect  the  news  articles  is 
provided  in  Appendix  B  Table  18.  The  information  collected  via  the  scanning  process  is 
the  content  of  the  articles.  Each  article  in  the  training  set  is  analyzed  to  determine 
whether  the  article  indicates  an  increase  or  a  decrease  in  stock  return  relative  to  the 
market  or  a  change  in  trading  volume  as  compared  to  the  previous  day. 

3.4  Vector  Space  Representation  and  Discriminant  Analysis 

For  each  stock,  the  k  articles  collected  will  be  used  to  determine  if  the  text  or 
terms  in  the  article  indicate  whether  the  stock's  return  will  increase  or  decrease  relative  to 
the  market  in  the  target  period  following  the  report  or  whether  the  stock's  trading  volume 
will  increase  or  decrease.  Once  the  articles  or  documents  are  collected,  a  set  of  n  index 
terms  needs  to  be  determined.  Let  t, ,. .  .,t„  for  t,  e  9^ " ,  be  the  vectors  corresponding  to 
the  n  index  terms.  The  term  vectors  form  a  vector  space.  When  the  terms  are  linearly 
independent,  the  dimensionality  of  the  vector  space  is  n .  Table  4  summarizes  the 
notation  for  the  vector  space  representation  of  the  problem. 

With  full  dimensionality,  each  document  can  be  vmtten  as  a  linear  combination  of 
term  vectors.  Articles  or  documents  are  represented  by  vectors,  For  the  collection  of 
k  documents  or  news  articles  for  each  stock,  an  n  x  k  document  by  term  matrix  D  can  be 
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constructed  with  the  document  vectors  where  each  column  of  the  document  matrix 
corresponds  to  a  document  vector  • 

n 

i=l 

=De,  =TAe^    r  =  l,..,k 
The  elements  of     are  the  term  weights  when  the  terms  are  orthogonal,  which  can  be 

confirmed  by  t j  d^  =  a. ^ .  The  a^^  's  are  determined  by  an  indexing  operation  on  the 
docimient  collection. 


Table  4.  Variable  definition 


Variable 

Meaning 

k 

Number  of  documents 

ki 

Number  of  documents  in  group  i ,  i  =  1,2 ,  k  =  k,  +  k. 

n 

Number  of  terms 

t. 

n  X 1  term  vector  representing  term  i 

T 

n  X  n  term  matrix  where  t;  's  are  the  columns 

nxl  vector  for  document  r,  d^  =De^ 

D 

n  X  k  matrix  where  d  's  are  the  columns 

r 

D. 

n  X  ki  matrix  where  d^  's  in  group  i  are  the  columns,  i  =  1,2 

A 

nxk  term  document  matrix  where  a-^  is  the  weight  of  term  i  in 

document  r 

n  X 1  term  weight  vector  for  document  r ,     =  Ae^ 

G. 

nxn  term-term  correlation  matrix  where  g.^  =ti  -tj  is  the  correlation 

between  term  tj  and  tj 

Ga 

kxk  document-document  correlation  matrix  where  g^^  =  d^  -d,  is  the 

correlation  between  document  d  and  d 

;r, 

Documents  in  group  i ,  i  =  1 ,2 

z 

Cutting  score  for  Fisher  discriminant  analysis 

q 

Distance  variable  in  for  LP  formulation  of  discriminant  analysis 

qi 

kj  X 1  vector  of  distances  for  documents  in  group  i  =  1,2 

1, 

Column  vector  Nj  x  1  of  ones,  i  =  1,  2 

I 

I 
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Once  the  documents  have  the  above  representation  for  each  stock,  each  document 
needs  to  be  classified  according  one  of  two  methods.  The  first  method  classifies 
documents  as  first  appearing  a  day  prior  to  the  stock  return  increasing  relative  to  the 
market  or  as  first  appearing  a  day  prior  to  the  stock  return  decreasing  relative  to  the 
market.  All  the  documents  that  correspond  to  an  increase  or  no  change  in  the  stock's 
return  are  group  1  documents,  tt^  ,  and  the  documents  that  correspond  to  a  decrease  in 
return  are  group  2  documents,     .  The  second  method  classifies  documents  as  first 
appearing  a  day  prior  to  the  stock's  trading  volume  increasing  as  compared  to  the 
previous  day  or  as  first  appearing  a  day  prior  to  the  stock's  trading  volume  decreasing. 
All  the  documents  that  correspond  to  an  increase  in  the  stock's  trading  volume  are  group 
1  documents,  ;r, ,  and  the  documents  that  correspond  to  a  decrease  or  no  change  in 
volume  are  group  2  documents,     .  Hence,  all  documents  are  classified  as  dj ,  i  =  1  if 

€  ;r,  and  i  =  2  if  d^  €     .  Let  kj  be  the  number  of  group  1  documents  so  that 
kj  =k-k,  is  the  number  of  group  2  documents.  Let  d,,...,d,^^  be  the  group  1 
documents  and  d^ d^  be  the  group  2  documents. 

Discriminant  analysis  is  then  used  to  derive  a  variate,  w'd^ ,  that  best 
discriminates  between  the  groups,  hi  the  classic  Fisher  approach,  the  elements  of  w  are 
weights  that  are  determined  by  maximizing  the  between  group  variance  relative  to  the 
within  group  variance.  The  discriminant  score,     ,  is  calculated  for  each  document  by 
=  Wd^ .  The  score  is  used  to  predict  whether  the  document  is  in  group  1  or  group  2 
according  to  the  cutting  score  z  by 
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w'd^  >  z  e  7r, 
w'd,  <  z    d  G  7t, 


The  cutting  score,  z ,  and  the  weights,  w ,  are  determined  according  to  the  methods 
described  in  Section  2.3  on  discriminant  analysis. 

When  terms  are  uncorrelated  and  orthogonal 

Mj=5(i  =  j). 

Hence, 

T  =  I. 

The  equations  in  (1)  simplify  to 

w'a^  >  z  en^ 
w'a,  <  z   d,  € 

Fisher's  sample  linear  discriminant  function  (Johnson  and  Wichem  1982)  is  given  by 


where 


_(k,-l)S?+(k,-l)S^ 
(k-2) 


1 


'^l  j=l 


-       1  ^ 

j=k, 


and 
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'^2      ^  j=k, 

which  simpHfies  to 

y  =  (ai  -a2)Sp^,,,a, 
where  the  cutting  score  is  given  by 


CHAPTER  4 
ENVIRONMENTAL  SCANNING  PROCESS 

The  purpose  of  this  chapter  is  to  provide  further  details  of  the  methodology 

developed  in  this  dissertation.  The  purpose  of  this  dissertation  is  to  automate  the  process 

of  environmental  scanning  on  the  World  Wide  Web  via  the  use  of  web-based  methods. 

This  will  be  done  in  a  series  of  steps  described  in  some  detail  in  this  chapter.  A  summary 

of  these  steps  is  provided  in  Figure  1.  A  detailed  discussion  of  each  step  is  given  in 

Section  4.1.  Section  4.2  outlines  the  plan  for  testing  the  data  collected. 

4.1  Summary 

The  first  step  involves  automating  the  collection  of  web  documents.  A  computer 
program  in  Java  was  written  that  automatically  collects  potentially  relevant  documents 
from  previously  identified  web  sites  given  in  Appendix  B  Table  18.  The  documents  are 
news  articles  pertaining  to  a  predefined  set  of  stocks,  given  in  Appendix  B  Table  19.  The 
program  visits  each  of  these  sites  on  a  daily  basis  and  collects  news  articles  related  to 
specific  stocks  on  the  list  of  stocks.  The  html  documents  are  stored  locally  in  files.  Each 
document  for  each  stock  can  be  classified  into  one  of  two  groups  using  the  current  stock 
return  or  trading  volume  as  the  classification  mechanism.  By  the  first  classification 
mechanism,  news  documents  appearing  before  stock  returns  increase  (decrease)  relative 
to  the  market  return  are  classified  as  Tt,  ( ),  where  the  market  return  is  measured  by  the 
S&P500  hidex.  Alternatively,  news  documents  appearing  before  trading  volume  of  a 
stock  increases  (decreases)  are  classified  as  71,  (n^ ).  The  intuition  is  the  contents  of  the 
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news  documents  signal  an  increase  or  decrease  in  stock  returns  relative  to  the  market  or 
signal  a  change  in  trading  volume. 


3a.  Prepared 
Documents  for 
Model  Estimation 


4a.  Index  Term 
Identification 


4b.  Represent 
Documents  with 
Index  Terms 


5.  Classify  Documents 
as  Group  1  or  2 


r 

6.  Det 
Discrimina 
(Model  E 

srmine 
nt  Function 
stimation) 

Figure  1  Scanning  process 
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Collection 
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4c.  Represent 
Documents 
with  Index  Terms 


7.  Test 
Research 
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53 

The  second  step  in  the  procedure  involves  cleaning  the  html  documents.  Cleaning 
the  documents  involves  removing  html  tags  and  words  in  a  stopword  list  (Frakes  and 
Baeza- Yates  1992).  A  stopword  list  consists  of  words  that  are  very  frequent  in  the 
English  language  and  do  not  have  discriminatory  meaning,  such  as  the  words  "the"  and 
"of.  The  list  of  stopwords  is  given  in  Appendix  B  Table  20.  As  the  words  in  the 
stopword  list  are  very  frequent  in  document  collections,  they  do  not  help  distinguish 
documents  from  one  another  and  may  be  removed  from  the  document  collection. 
Cleaning  the  documents  involves  running  a  Java  program.  Additionally,  documents  are 
stemmed,  the  process  of  replacing  words  with  their  word  stems,  using  the  Porter 
stemming  algorithm  (Porter  1980). 

Once  the  documents  have  been  cleaned  and  stemmed,  they  are  indexed.  "The 
process  of  constructing  document  surrogates  by  assigning  identifiers  to  text  items  is 
known  as  indexing."  (Salton  1989,  p.  275)  The  idea  behind  indexing  is  to  find  the  set  of 
index  terms  that  best  represent  the  documents  in  the  collection.  The  method  of  indexing 
chosen  determines  how  the  documents  will  be  represented  in  vector  format.  Indexing  is 
done  in  accordance  with  the  algorithm  discussed  in  Section  2.2  using  term  weights 
computed  by  multiplying  term  frequency  by  the  inverse  document  frequency,  hidexing  is 
accomplished  via  running  a  series  of  three  Java  programs.  The  first  creates  a  master  list 
of  terms  in  the  entire  document  collection  for  each  stock.  The  second  computes  the 
inverse  document  frequency  of  each  term  in  the  document  collection,  only  including 
terms  that  occur  five  or  more  times  in  the  entire  document  collection.  Finally,  each 
document  is  represented  numerically  by  counting  the  term  frequency  for  each  distinct 
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stem  in  the  document  and  multiplying  by  the  corresponding  inverse  document  frequency 
for  that  term. 

After  the  documents  have  been  indexed,  they  are  in  vector  format.  Next,  their 
classification,  n,  or     ,  will  be  determined.  The  first  classification  method  involves 
measuring  stock  returns  relative  to  the  market.  Let  r;  be  the  rate  of  return  of  stock  i  and 
r„  be  the  rate  of  return  of  the  market  as  measured  by  the  S&P500  Index.  A  stock's 
performance  relative  to  the  market  is  measured  by  computing  the  difference  between  the 
stock's  rate  of  return  ^.  and  the  market's  rate  of  return  r^ .  News  documents  appearing  a 
day  before  r;  -  r^  >  0  are  classified  as  tt,  and  documents  appearing  a  day  before 
rj  -  r^  <  0  are  classified  as  tIj  . 

The  second  method  of  classification  is  based  on  trading  volume.  News 
documents  appearing  a  day  before  an  increase  in  a  stock's  trading  volume  as  compared  to 
the  previous  day's  trading  volume  are  classified  as  7t,  and  documents  appearing  a  day 
before  a  decrease  or  no  change  in  trading  volume  are  classified  as  . 

After  classification,  each  document  is  in  a  form  that  can  be  analyzed  via 
discriminant  analysis.  The  set  of  independent  variables  for  the  discriminant  procedure  is 
given  by  the  set  of  index  terms  and  the  dependent  variable  is  either  classification  based 
on  return  or  trading  volume.  Each  document  is  an  observation  in  the  discriminant 
procedure.  The  vector  representation  and  discriminant  analysis  were  discussed  in  Section 
3.4.  Discriminant  analysis  is  performed  in  two  steps.  First,  a  forward  stepwise 
discriminant  procedure  is  employed  with  the  significance  level,  set  at  15%,  of  an  F  test 
from  an  analysis  of  covariance  used  as  the  selection  criteria  for  a  variable  to  enter  the 
model.  Next,  Fisher's  (1936)  linear  discriminant  model  is  determined  by  calculating  the 
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linear  discriminant  coefficients,  w ,  using  the  variables  that  entered  the  model  via  the 
stepwise  procedure  and  calculating  the  cutting  score,  z . 

Once  the  discriminant  function  is  determined,  the  validity  of  the  discriminant 
model  will  be  checked  via  a  holdout  sample.  Documents  in  the  holdout  sample  will  be 
used  to  predict  whether  stock  returns  will  increase  or  decrease  relative  to  the  market  or 
whether  trading  volume  will  increase  or  decrease.  The  values  of  z  and  w  computed  in 
the  discriminant  procedure  will  be  updated  periodically  according  to  the  newly  collected 
data.  The  plan  for  testing  the  process  discussed  in  this  section  is  given  in  Section  4.2. 

4.2  Experimental  Plan 

The  document  collection  is  divided  into  an  80%  training  sample  and  a  20% 
holdout  sample  by  using  the  first  80%  of  the  document  collected  in  time  in  the  training 
sample  and  the  next  20%  in  the  holdout  sample.  Classification  matrices  for  the  training 
and  the  holdout  samples  are  examined.  To  assess  external  validity  of  the  model  statistical 
significance  for  the  holdout  sample  classification  matrix  is  determined  via  Press's  Q 
using  a  chi-square  distribution  with  one  degree  of  freedom  for  two-group  classification. 
Press's  Q  is  given  by 

Press'sQ  =  %i^ 

n(k-i) 

where  N  is  the  total  sample  size,  n  is  the  number  of  correctly  classified  observations  and 
K  is  the  number  of  groups  (Hair  et  al.  1998). 

The  internal  validity  of  the  linear  discriminant  model  is  checked  by  examining  the 
classification  matrix  for  the  training  set  and  by  examining  the  classification  matrix 
determined  via  a  jackknife  cross-validation  procedure.  The  proportional  chance  criterion 
(Hair  et  al.  1998)  is  calculated  for  both  classification  matrices.  The  proportional  chance 
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criterion  is      =     +{[-  pf  where  p  is  the  proportion  of  group  1  documents  in  the 


training  set.  The  proportional  chance  criterion  is  compared  to  the  hit  ratio,  p^ ,  defined 
as  the  proportion  of  documents  that  are  classified  correctly  by  the  discriminant  model.  If 
the  hit  ratio  is  statistically  significantly  larger  than  the  proportional  chance  criterion 
according  to  the  z-statistic 


then  the  classification  by  the  jackknife  cross-validation  is  statistically  better  than  chance. 

Testing  the  training  set  and  the  cross-validation  classification  matrices  via  the 
proportional  chance  criterion  addresses  research  question  one  (RQl)  given  in  Section  3.2. 
Testing  the  holdout  sample  addresses  research  question  two  (RQ2). 


1.  Check  80%  training  sample  for  validity  via  classification  matrix  and  leave-one-out 
classification 

2.  Check  20%  holdout  sample  for  validity  via  classification  matrix 

3.  Compare  the  performance  of  the  two  classificadon  mechanisms 

4.  Check  20%  holdout  sample  results  for  improvement  with  independent  variables 
augmented  with  x, 


In  addition,  as  a  benchmark,  classification  of  the  documents  is  done  randomly,  as 
opposed  to  using  actual  daily  changes  in  stock  returns  or  trading  volume  for  determining 
group  membership.  Performance  of  the  model  on  random  classification  is  compared  to 
performance  using  actual  classification  mechanisms  to  determine  if  the  actual 
classification  performs  statistically  better  than  random  classification. 

Finally,  an  independent  variable,  x, ,  is  added  to  the  set  of  independent  variables. 
The  variable  represents  the  stock's  performance  in  the  three  days  prior  to  the  date  used 


Table  5.  Summary  of  experimental  plan 
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for  classification.  The  variable,  x, ,  is  calculated  by  analyzing  the  classification  of  the 
stock  based  on  return  or  trading  volume  over  these  three  days.  Table  6  summarizes  the 
values  of  X,.  Group  membership  is  assessed  for  these  three  days,  resuUing  in  eight 
possible  combinations.  To  compute  the  value  of  x,,  group  membership  is  coded  in 
binary,  group  one  corresponds  to  a  one  and  group  two  to  a  zero.  The  value  of  x,  is  given 
by 

x,=x;(2^)+x,^(2>)+xK2°). 


An  advantage  of  using  this  method  of  calculating  x,  is  that  more  weight  is  given  to  the 
most  recent  day's  classification. 
Table  6.  Summary  of  values  of  x, 


One  Day  Prior  x[ 

Two  Days  Prior  x^ 

Three  Days  Prior  x^ 

Value  of  X, 

1 

1 

1 

7 

1 

1 

2 

6 

1 

2 

1 

5 

1 

2 

2 

4 

2 

1 

1 

3 

2 

1 

2 

2 

2 

2 

1 

1 

2 

2 

2 

0 

In  light  of  the  experimental  plan,  two  additional  research  questions  are: 
RQ3:  Which  method  of  classification  works  best,  the  stock  return  method  or  the  method 
based  on  trading  volume  changes? 

RQ4:  Do  holdout  sample  results  improve  when  adding  x,  as  an  independent  variable? 
A  summary  of  the  experimental  plan  is  given  in  Table  5.  Each  numbered  item  in  Table  5 
addresses  the  same  numbered  research  question.   The  results  of  the  plan  are  given  in 
Chapter  5. 


CHAPTER  5 
RESULTS 

The  purpose  of  this  chapter  is  to  discuss  the  results  of  the  data  analysis  outlined  in 
Section  4.2.  There  were  186  stocks  in  the  original  data  set  listed  in  Appendix  B  Table  19. 
Stocks  with  200  articles  or  less  were  not  considered  as  they  did  not  even  average  more 
than  one  article  per  day.  This  provides  a  sample  of  96  stocks  remaining  in  the  analysis. 
Additionally,  stocks  that  did  not  trade  on  all  days  that  data  was  collected  were  removed 
from  the  list  of  stocks  for  analysis,  leaving  93  stocks. 

The  chapter  starts  with  an  overview  of  the  document  collections  for  each  stock  in 
Section  5.1.  Section  5.2  provides  results  for  the  80%  training  set.  Section  5.3  provides 
results  for  the  20%  holdout  sample.  Section  5.4  provides  results  for  the  case  where  the 
set  of  independent  variables  is  augmented  with  a  variable  representing  the  stock's 
performance  in  the  previous  three  days. 

5.1  Summary  of  Document  Collections 
Table  7  provides  a  summary  of  the  size  of  the  training  set  (training  no.  of  docs), 
the  size  of  the  holdout  set  of  documents  (holdout  no.  of  docs)  and  the  number  of  word 
stems  remaining  after  indexing  (no.  of  stems).  The  average  training  set  size  is  974,  the 
average  holdout  sample  size  is  195  and  the  average  number  of  word  stems  is  5217.  Table 
7  also  provides  the  number  of  group  1  documents  in  the  training  set  (training  group  1), 
the  number  of  group  2  documents  in  the  training  set  (training  group  2),  the  number  of 
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group  1  documents  in  the  holdout  sample  (holdout  group  1)  and  the  number  of  group  2 
documents  in  the  holdout  sample  (holdout  group  2). 

Recall  that  the  maximum  number  of  entering  variables  in  the  step-wise 
discriminant  procedure  is  the  total  number  of  documents  in  the  collection  for  a  given 
stock  divided  by  four.  When  using  classification  based  on  return,  39  of  the  stocks 
entered  the  maximum  number  of  variables  allowed.  When  using  classification  based  on 
volume,  41  of  the  stocks  entered  the  maximum  number  of  variables  allowed. 

5.2  Training  Set  Results 

Table  8  provides  the  classification  matrix  for  the  training  set  of  documents  for  the 
stocks  with  classification  based  in  stock  returns.  The  average  hit  ratio,  defined  as  the 
percent  of  correctly  classified  documents  in  the  collection,  is  92. 11%  for  Table  8.  Using 
the  z-statisfic  based  on  the  proportional  chance  criterion  defined  in  Section  4.2,  every 
stock  has  a  classification  matrix  with  correct  classificafion  that  is  significant  at  1%. 

Table  9  provides  the  jackknife  cross-validation  classification  matrix  for  the 
training  set  of  documents  for  stocks  with  classification  based  on  stock  returns.  The 
average  hit  ratio  is  89.06%  for  Table  9.  The  z-stafistic  for  the  jackknife  cross-validation 
classification  matrix  for  every  stock  is  significant  at  1%. 

Table  10  provides  the  classification  matrix  for  the  training  set  of  documents  for 
the  stocks  with  classification  based  on  changes  in  volume.  The  average  hit  ratio  is 
91.68%  for  Table  10.  Using  the  proportional  chance  criterion,  every  stock  has  a 
classification  matrix  with  correct  classificafion  that  is  significant  at  1%. 

Table  1 1  provides  the  jackknife  cross-validafion  classificafion  matrix  for  the 
training  set  of  documents  for  stocks  with  classificafion  based  on  changes  in  volume.  The 
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average  hit  ratio  is  88.68%  for  Table  11.  Using  the  proportional  chance  criterion,  every 
stock  has  a  leave-one-out  cross-validation  classification  matrix  v^ith  correct  classification 
that  is  significant  at  1%. 

5.3  Holdout  Sample  Results 

Table  12  provides  a  summary  of  the  results  of  classification  based  on  return  for 
the  20%  holdout  sample.  The  average  hit  ratio  is  50.25%  for  Table  12.  Using  Press's  Q, 
16  out  of  93  stocks  have  a  holdout  classification  matrix  with  correct  classification  that  is 
significant  at  \0%.  As  a  benchmark,  a  subset  of  71  of  the  documents  was  classified 
randomly,  with  the  classification  proportional  to  the  number  of  documents  in  each  group. 
With  random  classification,  only  5  stocks  out  of  71  had  a  holdout  classification  matrix 
that  has  significant  correct  classification  at  10%.  Originally,  13  out  of  the  71  stocks  had  a 
holdout  classification  matrix  with  correct  classification  significant  at  10%.  The  p-value 
for  the  difference  between  the  two  proportions,  13  out  of  71  and  5  out  of  71,  for  a  one- 
tailed  test  is  0.0217. 

Table  13  provides  a  summary  of  the  resuUs  of  classification  based  on  volume  for 
the  20%  holdout  sample.  The  average  hit  ratio  is  49.67%)  for  Table  13.  Using  Press's  Q, 
16  out  of  93  stocks  have  a  holdout  classification  matrix  with  correct  classification  that  is 
significant  at  10%.  As  a  benchmark,  a  subset  of  the  documents  was  classified  randomly, 
with  the  classification  proportional  to  the  number  of  documents  in  each  group.  With 
random  classification,  only  5  stocks  out  of  72  had  a  holdout  classification  matrix  that  has 
significant  correct  classification  at  10%.  Originally,  13  out  of  the  72  stocks  had  a  holdout 
classification  matrix  with  correct  classification  significant  at  10%.  The  p-value  for  the 
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difference  between  the  two  proportions,  13  out  of  72  and  5  out  of  72,  for  a  one-tailed  test 
is  0.0217. 

5.4  Prior  Classiflcation  Variable  Results 

Table  14  provides  the  classification  matrix  for  the  training  set  of  documents  for 
the  stocks  with  classification  based  on  stock  returns  with  the  set  of  independent  variables 
augmented  with  x, ,  a  variable  representing  the  classification  of  the  stock  for  the  three 
days  prior  to  the  date  used  for  classification  of  the  dependent  variable.  Only  stocks  that 
had  X,  enter  the  model  during  the  step-discriminant  procedure,  a  collection  of  82  stocks, 
are  included  in  Table  14.  The  average  hit  ratio,  defined  as  the  percent  of  correctly 
classified  documents  in  the  collection,  is  93.09%  for  Table  14.  Using  the  z-statistic  based 
on  the  proportional  chance  criterion  defined  in  Section  4.2,  every  stock  has  a 
classification  matrix  with  correct  classification  that  is  significant  at  1%.  Additionally,  the 
average  step  number  that  x,  entered  the  model  in  the  step-discriminant  procedure  is 
41.48. 

Table  15  provides  the  classification  matrix  for  the  training  set  of  documents  for 
the  stocks  with  classification  based  on  changes  in  trading  volume  with  the  set  of 
independent  variables  augmented  with  x,.  Only  stocks  that  had  x,  enter  the  model 
during  the  step-discriminant  procedure,  a  collection  of  91  stocks,  are  included  in  Table 
15.  The  average  hit  ratio,  defined  as  the  percent  of  correctly  classified  documents  in  the 
collection,  is  94.28%  for  Table  15.  Using  the  z-statistic  based  on  the  proportional  chance 
criterion  defined  in  Section  4.2,  every  stock  has  a  classification  matrix  with  correct 
classification  that  is  significant  at  1%.  Additionally,  the  average  step  number  that  x, 
entered  the  model  in  the  step-discriminant  procedure  is  4.12. 
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Table  16  provides  a  summary  of  the  results  of  classification  based  on  return  with 
the  set  of  independent  variables  augmented  with  x,  for  the  20%  holdout  sample.  The 
average  hit  ratio  is  49.91%  for  Table  16.  Using  Press's  Q,  15  out  of  82  stocks  have  a 
holdout  classification  matrix  with  correct  classification  that  is  significant  at  10%. 

Table  1 7  provides  a  summary  of  the  results  of  classification  based  on  volume  with 
the  set  of  independent  variables  augmented  with  x,  for  the  20%  holdout  sample.  The 
average  hit  ratio  is  52.92%  for  Table  17.  Using  Press's  Q,  32  out  of  91  stocks  have  a 
holdout  classification  matrix  with  correct  classification  that  is  significant  at  10%. 
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Table  7.  Return  classification  collection  summary 


i  raining 

Trammg 

Trammg 

T  T      1  J  J. 

Holdout 

Holdout 

Holdout 

iNo.  or 

XT  _    ^  r* 

No.  01 

Group  1 

Group  2 

No.  of 

Group  1 

Group  2 

olOCK 

Stems 

Docs 

Docs 

A  A  DT 

AArL 

noon 
705/ 

1  O  C  ZT 

1356 

671 

685 

221 

151 

70 

ADC 

3o7U 

C  /I  o 

548 

288 

260 

135 

75 

60 

2133 

238 

106 

132 

114 

30 

84 

3179 

422 

227 

195 

73 

52 

21 

A  CT' 

4o5o 

724 

426 

298 

161 

75 

86 

A  tT'CV 

ArrA 

zj  /y 

245 

142 

103 

29 

21 

8 

A  MTT 

AJNr 

37y5 

474 

305 

169 

60 

41 

19 

AUL 

1U347 

2623 

1156 

1467 

570 

308 

262 

A  CT/'  T 

3Uoy 

329 

144 

185 

59 

37 

22 

A  YD 
AAr 

OJZO 

1543 

OO  /I 

884 

659 

368 

168 

200 

R  A 

ojI  J 

2557 

1344 

1213 

577 

312 

265 

R  A  T3 

/I  cri/; 
451)0 

1028 

495 

533 

233 

87 

146 

DC 

ZUl  J 

214 

127 

87 

56 

12 

44 

z/lz 

296 

160 

136 

87 

30 

57 

tJM  I 

/  /oo 

1762 

869 

893 

444 

238 

206 

DT  TT^ 
tJUU 

/I  'J  1  ^ 

4312 

709 

336 

373 

114 

72 

42 

ooJ2 

2257 

1209 

1048 

496 

241 

255 

rr^T  T 

4357 

594 

293 

301 

109 

57 

52 

UrliK 

3j2o 

447 

241 

206 

52 

34 

18 

J  J  JO 

zzU 

1  1  /\ 

110 

1 10 

25 

14 

11 

o2y2 

1097 

565 

532 

114 

77 

37 

^f\n  A 
2U/4 

226 

89 

137 

98 

41 

57 

1  AA'5  O 

lUUJo 

1  T  1 

2171 

1102 

1069 

389 

167 

222 

5254 

704 

354 

350 

86 

33 

53 

r»  AT 

/54U 

1  CO 

2158 

1 122 

1036 

427 

207 

220 

uu 

/JUl 

1 122 

633 

489 

232 

135 

97 

nFT  T 

/y2U 

1  /  1  / 

903 

814 

312 

185 

127 

now 

5  i2y 

849 

445 

404 

96 

72 

24 

DPT 

24oU 

287 

199 

88 

77 

42 

35 

PR  A  V 
Co  A  I 

/4yu 

1298 

618 

680 

213 

131 

82 

PT  V 
tlJ_/ 1 

JOOO 

472 

234 

238 

81 

52 

29 

PDTr^V 

o7U4 

1330 

769 

561 

290 

160 

130 

c 
r 

9589 

2795 

1314 

1481 

543 

278 

265 

PT^Y 

6215 

1055 

625 

430 

245 

110 

135 

rrL 

1601 

253 

133 

120 

110 

73 

37 

55Uy 

1  1  /I  n 
1 149 

^  O  1 

581 

568 

251 

126 

125 

GE 

10249 

2413 

1100 

1313 

520 

305 

215 

GLW 

7275 

1208 

516 

692 

231 

61 

170 

GM 

10002 

2744 

1384 

1360 

596 

352 

244 

GPS 

7483 

1155 

446 

709 

158 

101 

57 

GR 

3549 

539 

252 

287 

80 

28 

52 
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Table  7.  Continued 


Training 

Training 

Training 

Holdout 

TT     1  J  i. 

Holdout 

T  T     1  J  i 

Holdout 

iNO.  01 

iNO.  Ot 

Group  1 

Group  2 

XT  C 

No.  of 

Group  1 

Group  2 

olOCK 

Stems 

Docs 

Docs 

Lrl  W 

oj4U 

1  AC  1 

1051 

463 

coo 

588 

182 

91 

91 

riUl 

3233 

333 

156 

177 

57 

49 

8 

TTT  TX  ^ 

rlUM 

407 

249 

ICO 

158 

36 

14 

22 

Lt>M 

iU441 

2426 

1028 

1398 

346 

209 

137 

TT  Y/~\ 

ZUo/ 

T  A/C 

206 

AO 

93 

1  1  o 

113 

17 

4 

13 

IJN  1 U 

moo 
y  /zo 

2359 

1311 

1048 

527 

207 

320 

TXTT 
JJNJ 

/TOOT 

1  '5  TA 

1370 

ZT  AO 

693 

677 

344 

176 

168 

ISJVl 

ozUU 

1  /COO 

1628 

^AC 

795 

o  o  o 

833 

472 

258 

214 

JsjC 

43  yy 

559 

294 

265 

159 

87 

72 

LIZ, 

z3oo 

252 

147 

105 

64 

19 

45 

zooU 

O  AA 

299 

133 

166 

43 

37 

6 

T  T  V 

o  /Uz 

1396 

766 

630 

244 

124 

120 

0  /UU 

1  C  C  A 

1550 

TA  1 

791 

759 

399 

159 

240 

T  T  r\/ 

LU  V 

0133 

1 169 

oil 

497 

209 

94 

115 

ML/JJ 

71/10 

1  A  f\£i 

1406 

TO  A 

720 

686 

278 

135 

143 

/yjv 

1514 

OO  1 

831 

683 

399 

203 

196 

MUIN 

/CIA 

610 

O  O  A 

339 

271 

190 

94 

96 

MU  1 

QTO^ 

yzoj 

O  AAA 

2090 

949 

1141 

379 

171 

208 

IVIKJV 

OAC^ 

oU63 

1282 

^70 

673 

609 

377 

180 

197 

IVl  I  \jr 

3Uol 

O  AA 

809 

>l  1  o 

413 

396 

85 

45 

40 

iNJSJi 

3z/l 

AA  C 

415 

O  1  o 

212 

203 

89 

21 

68 

JOlZ 

1  ACO 

10j3 

C  AO 

548 

505 

210 

124 

86 

ZlOO 

432 

O  y1  A 

240 

192 

140 

49 

91 

iM  V  o 

A 1  An 
410U 

III 

366 

o  c  y 

356 

182 

106 

76 

NTYTT 

J  lyj 

831 

/IOC 

425 

406 

146 

89 

57 

ouys 

1  1  O"? 

1 183 

CAT 

597 

586 

226 

174 

52 

row 

ZiU4 

TIC 

315 

1  TA 

179 

136 

63 

47 

16 

1  Q1A 

iy34 

294 

1  AO 

198 

96 

67 

24 

43 

rtzr 

JZZC) 

Al  O 

932 

o  o  o 

382 

550 

157 

88 

69 

Ptr  A 

/I  C  /I  c 

4j4j 

131 

393 

344 

205 

121 

84 

KxJlS. 

ZD3Z 

321 

144 

177 

59 

37 

22 

DT 

2073 

204 

117 

87 

19 

13 

6 

loyy 

216 

84 

132 

17 

9 

8 

S 

6390 

1008 

661 

\J\J  1 

/ 

1  ^1 

3  / 

1  1  Z" 

116 

SNE 

5762 

1006 

563 

443 

199 

95 

104 

SO 

2238 

273 

104 

169 

90 

36 

54 

SOI 

2546 

269 

176 

93 

28 

15 

13 

TGT 

6759 

1011 

624 

387 

138 

56 

82 

TM 

3673 

732 

365 

367 

163 

69 

94 

65 


Table  7.  Continued 


Training 

Training 

1  raining 

Holdout 

T  T     1  J  J. 

Holdout 

T  T      1  J  A 

Holdout 

JNO.  01 

JNO.  01 

Group  1 

Group  2 

No.  of 

Group  1 

Group  2 

OlOCK 

Stems 

Docs 

Docs 

1  Mr  W 

4Joo 

O/O 

323 

347 

60 

33 

27 

TfWA 

1  A  1 

101 

124 

16 

5 

11 

JO  J  J 

y  /  / 

515 

462 

226 

135 

91 

TYT  T 
1 AU 

inn 

/I  OA 

480 

202 

278 

60 

47 

13 

r  TC  A  T 

4j  /y 

/07 

o  o  c 

325 

382 

108 

30 

78 

WlirN 

OQO/1 

zyo4 

428 

165 

263 

90 

30 

60 

WrlK 

JUo4 

424 

229 

195 

86 

45 

41 

WIN 

2044 

228 

130 

98 

uo 

97 

A  1 
41 

WMT 

8023 

1670 

904 

766 

319 

132 

187 

WPPGY 

2624 

349 

185 

164 

69 

24 

45 

X 

3158 

549 

357 

192 

51 

29 

22 

XEL 

2290 

277 

133 

144 

131 

100 

31 

XOM 

7610 

1473 

722 

751 

277 

121 

156 
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Table  8.  Return  c 


Stock 


AAPL 

ABS 

ADSX 

AEOS 

AET 

AFFX 

ANF 

AOL 

ASKJ 

AXP 

BA 

BAB 

BC 

BDK 

BMY 

BUD 

C 

ecu 

CHIR 

CMTN 

COX 

CREE 

CSCO 

CVC 

DAL 

DD 

DELL 

DOW 

DRI 

EBAY 

ELY 

ERICY 

F 

FDX 

FPL 

GD 

GE 

GLW 

GM 

GPS 

GR 

GTW 

HDI 


assification  training  set  results 


Hit  Ratio 


90.93% 

94.53% 

88.66% 

86.26% 

92.96% 

95.92% 

89.66% 

97.18% 

90.88% 

94.10% 

97.11% 

96.30% 

91.59% 

85.47% 

94.89% 

83.36% 

95.30% 

90.91% 

93.29% 

87.73% 

89.61% 

95.58% 

95.72% 

88.49% 

95.92% 

90.37% 

93.83% 

91.99% 

90.24% 

90.91% 

92.58% 

92.93% 

94.06% 

91.37% 

96.84% 

92.08% 

93.37% 

93.96% 

95.41% 

90.82% 

89.05% 

92.29% 

85.89% 


Chance 

Z-Statistic 

Significant 

0.50 

30.14 

yes 

0.50 

20.79 

yes 

0.51 

11.74 

yes 

0.50 

14.78 

yes 

0.52 

22.29 

yes 

0.51 

13.98 

yes 

0.54 

15.53 

yes 

0.51 

47.61 

yes 

0.51 

14.55 

yes 

0.51 

33.82 

yes 

0.50 

47.51 

yes 

0.50 

29.65 

yes 

0.52 

11.66 

yes 

0.50 

12.09 

yes 

0.50 

37.68 

yes 

0.50 

17.69 

yes 

0.50 

42.80 

yes 

0.50 

19.94 

yes 

0.50 

18.18 

yes 

0.50 

11.19 

yes 

0.50 

26.21 

yes 

0.52 

13.04 

yes 

0.50 

42.59 

yes 

0.50 

20.43 

yes 

0.50 

42.59 

yes 

0.51 

26.50 

yes 

0.50 

36.21 

yes 

0.50 

24.40 

yes 

0.57 

11.23 

yes 

0.50 

29.40 

yes 

0.50 

18.50 

yes 

0.51 

30.43 

yes 

0.50 

46.40 

yes 

0.52 

25.78 

yes 

0.50 

14.86 

yes 

0.50 

28.52 

yes 

0.50 

42.23 

yes 

0.51 

29.82 

yes 

0.50 

47.57 

yes 

0.53 

26.02 

yes 

0.50 

18.04 

yes 

0.51 

26.97 

yes 

0.50 

13.02 

yes 
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Table  8.  Continued 


Stock 


Hit  Ratio 


Chance 


Z-Statistic 


Significant 


HUM 

IBM 

ILXO 

INTC 

JNJ 

KM 

KR 

LIZ 

LLTC 

LLY 

LMT 

LUV 

MCD 

MO 

MON 

MOT 

MRK 

MYG 

NKE 

NOK 

NSANY 

NVS 

NXTL 

ORCL 

PBG 

PD 

PEP 

PHA 

RBK 

RL 

ROST 
S 

SNE 

SO 

SOI 

TGT 

TM 

TMPW 

TOM 

TXN 

TXU 

USAI 


91.40% 

94.77% 

92.23% 

95.13% 

94.16% 

92.51% 

82.47% 

90.08% 

85.95% 

94.48% 

95.87% 

91.62% 

92.18% 

93.92% 

92.30% 

94.40% 

94.31% 

85.78% 

89.16% 

95.06% 

95.14% 

96.95% 

91.46% 

96.11% 

83.17% 

97.62% 

89.27% 

94.84% 

84.42% 

85.78% 

81.02% 

91.96% 

96.32% 

95.97% 

96.28% 

91.49% 

96.72% 

93.43% 

90.22% 

95.80% 

96.04% 

92.64% 


0.52 

0.51 

0.50 

0.51 

0.50 

0.50 

0.50 

0.51 

0.51 

0.50 

0.50 

0.51 

0.50 

0.50 

0.51 

0.50 

0.50 

0.50 

0.50 

0.50 

0.51 

0.50 

0.50 

0.50 

0.51 

0.56 

0.52 

0.50 

0.51 

0.51 

0.52 

0.55 

0.51 

0.53 

0.55 

0.53 

0.50 

0.50 

0.51 

0.50 

0.51 

0.50 


15.72 
42.96 
11.99 
43.23 
32.69 
34.28 
15.29 
12.29 
12.22 
32.89 
36.10 
27.70 
31.61 
33.81 
20.59 
40.21 
31.64 
20.34 
15.94 
29.19 
18.51 
25.23 
23.89 
31.72 
11.45 
14.37 
23.00 
24.23 
12.15 
9.92 
8.40 
23.68 
28.94 
14.28 
13.68 
24.68 
25.28 
22.45 
11.91 
28.54 
19.63 
22.51 


yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 
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Table  8.  Continued 


lilt  Katio 

Chance 

Z-Statistic 

Significant 

WbiN 

O/Z  /ZOO/ 

so. 68% 

0.53 

14.11 

yes 

VVrlK 

92.69% 

0.50 

17.45 

yes 

WTN 

VV  11^ 

oo.OU  /o 

U.J  1 

1  1 

1  1  .JO 

yes 

WMT 

95.57% 

0.50 

36.97 

yes 

WPPGY 

97.99% 

0.50 

17.86 

yes 

X 

89.25% 

0.55 

16.34 

yes 

XEL 

90.25% 

0.50 

13.37 

yes 

XOM 

94.37% 

0.50 

34.04 

yes 

1 

i 

I 
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Table  9.  Return  c 

assification  training  set  leave-one-out  cross-validation 

otOCK 

Hit  Ratio 

Chance 

Z-Statistic 

AAPL 

87.39% 

0.50 

27.53 

ABS 

90.51% 

0.50 

18.91 

ADSX 

82.77% 

0.51 

9.93 

AEOS 

83.89% 

0.50 

13.80 

AET 

90.88% 

0.52 

21.17 

AFFX 

93.47% 

0.51 

13.22 

ANF 

86.50% 

0.54 

14.15 

AOL 

94.97% 

0.51 

45.35 

ASKJ 

86.32% 

0.51 

12.90 

AXP 

91.12%) 

0.51 

31.48 

r»  A 

BA 

94.80% 

0.50 

45.17 

BAB 

93.58% 

0.50 

27.90 

BC 

89.72% 

0.52 

11.12 

BDK 

0.50 

9.19 

BMY 

92.96% 

0.50 

36.06 

BUD 

81.10% 

0.50 

16.49 

C 

92.25% 

0.50 

•  39.90 

ecu 

89.06% 

0.50 

19.03 

CHIR 

89.71% 

0.50 

16.66 

CMTN 

85.45% 

0.50 

10.52 

cox 

87.51% 

0.50 

24.82 

/~i  r-|  T->  1— > 

CREE 

92.48% 

0.52 

12.11 

CSCO 

93.74% 

0.50 

40.75 

cvc 

84.52% 

0.50 

18.32 

r\  A  T 

DAL 

92.96% 

0.50 

39.84 

DD 

87.52% 

0.51 

24.59 

DELL 

90.74% 

0.50 

33.65 

uow 

89.52% 

0.50 

22.96 

DRl 

87.46% 

0.57 

10.27 

EBAY 

87.83% 

0.50 

27.17 

■CT  ^7 

ELY 

88.56% 

0.50 

16.75 

EKJCY 

89.32% 

0.51 

27.80 

F 

91.20% 

0.50 

43.37 

FDX 

88.82% 

0.52 

24.12 

FPL 

92.09% 

0.50 

13.35 

GD 

89.12% 

0.50 

26.52 

GE 

90.14% 

0.50 

39.05 

GLW 

91.06% 

0.51 

27.81 

CM 

93.00% 

0.50 

45.05 

GPS 

87.97% 

0.53 

24.08 

GR 

87.57% 

0.50 

17.35 

GTW 

90.10% 

0.51 

25.55 

HDI 

83.18% 

0.50 

12.04 

Significant 


yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 
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Table  9.  Continued 

SIOCK 

rllt  Ratio 

Chance 

Z-Statistic 

TTT  TA /T 
HUM 

Tin/ 

87.71% 

0.52 

14.23 

LdM 

91.59% 

0.51 

39.84 

TT  V/^ 

86.41% 

0.50 

10.32 

IJN  1  L 

92.20% 

0.51 

40.39 

r\TT 
J  JNJ 

90.80% 

0.50 

30.20 

KJVl 

on  1  AO/ 

89.19% 

0.50 

31.60 

VD 

TO  1  on/ 

78.18% 

0.50 

13.26 

LIZ 

O  A  A^n/ 

84.92% 

0.51 

10.65 

TO     /'n / 

78.26% 

0.51 

9.56 

r  T  V 
LLY 

A'^     ^  n  / 

92.62% 

0.50 

31.50 

T  A/TT 
LMl 

94.32% 

0.50 

34.88 

T  T  TA^ 
LU  V 

AA  1  z'n/ 

90. 1 6% 

0.51 

26.70 

o  o  c  cn/ 

88.55% 

0.50 

28.89 

MU 

AA  O'^n/ 

90.82% 

0.50 

31.40 

MUiN 

o o  o cn / 

88.85% 

0.51 

18.89 

MUl 

A  A  A  1  n  / 

90.91% 

0.50 

37.02 

\/fr>T/' 
IVIKK 

A'^  T  c  n  / 

92.75% 

0.50 

30.52 

MYCj 

77.00% 

0.50 

11.13 

KT1/"C 

NKb 

84.82% 

0.50 

14.18 

IN  UK 

A^  o n / 

93.83% 

0.50 

28.39 

JNoAiN  Y 

A*^  T cn / 

93.75% 

0.51 

17.93 

IN  Vb 

AC   y(    n  / 

95.43% 

0.50 

24.41 

MAIL 

OA  Acn/ 

89.05% 

0.50 

22.50 

UKL-L 

A  /I    ZTTn / 

94.67% 

0.50 

30.73 

rrJCjf 

o  ^  o    n  / 

82.22% 

0.51 

11.11 

rL> 

AzT  /^An/ 

96.60% 

0.56 

14.02 

rLr 

87.02% 

0.52 

21.62 

rxlA 

A'^      An  / 

92.94% 

0.50 

23.20 

TA  ^cn/ 

79.75% 

0.51 

10.47 

DT 

KL 

OA  o on / 

80.88% 

0.51 

8.51 

KUb  1 

TT  o  1  n  / 

77.31% 

0.52 

7.31 

c 
o 

O  T    ^  /\f\  / 

87.40% 

0.55 

20.77 

CXTC? 

94.63% 

0.51 

27.86 

ISO 

91.94% 

0.53 

12.94 

SOI 

92.19% 

0.55 

12.34 

TGT 

87.44% 

0.53 

22.09 

I  IVl 

QC  T)0/ 

yj.ZZ/o 

A   C  A 

0.50 

24.47 

TMPW 

90.15% 

0.50 

20.75 

TOM 

87.56% 

0.51 

11.11 

TXN 

93.76% 

0.50 

27.26 

TXU 

94.38% 

0.51 

18.90 

USAI  1 

89.25% 

0.50 

20.70 

Significant 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
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Table  9.  Continued 


oLUCK 

rill  Kaiio 

Chance 

Z-btatistic 

Significant 

W  jDiN 

01  "J  1 0/ 

U.D3 

1  1  OA 

11.89 

yes 

WT-TR 
Wrlrv. 

CO  "300/ 

U.  jU 

1  n  An 

16.09 

yes 

WIN 

OU.*Tw  /O 

W.J  1 

1  n  70 

yes 

WMT 

93.35% 

0.50 

35.15 

yes 

WPPGY 

96.28% 

0.50 

17.22 

yes 

X 

86.89% 

0.55 

15.23 

yes 

XEL 

88.81% 

0.50 

12.89 

yes 

XOM 

92.80% 

0.50 

32.84 

yes 
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Table  10.  Volume  classification  training  set  results 


block 

Hit  Ratio 

Chance 

Z-Statistic 

AAPL 

89.53% 

0.50 

28.79 

ABS 

94.11% 

0.50 

20.49 

ADSX 

90.21% 

0.56 

10.66 

AEOS 

83.85% 

0.50 

13.86 

A  T~"T' 

AET 

93.14%) 

0.51 

22.65 

AFFX 

94.21%) 

0.51 

13.55 

A  XTT7 

88.98%) 

0.51 

16.54 

AOL 

96.49%) 

0.50 

47.05 

ASKJ 

86.89% 

0.50 

13.35 

AXP 

94.12%) 

0.52 

32.59 

n  A 

BA 

96.83% 

0.50 

46.71 

BAB 

96.75% 

0.50 

29.73 

BC 

89.15% 

0.51 

11.22 

BDK 

84.12% 

0.51 

11.52 

BMY 

93.87%) 

0.50 

36.26 

BUD 

86.23%) 

0.50 

19.11 

L 

94.45%o 

0.50 

41.20 

LLU 

93.32%) 

0.50 

20.92 

LHIR 

94.83% 

0.51 

18.62 

CMTN 

80.00%o 

0.50 

8.80 

LUX 

89.74% 

0.50 

26.12 

LRbb 

96.38% 

0.50 

13.78 

Li)LO 

95.29% 

0.50 

41.71 

LVC 

88.24%) 

0.51 

19.82 

A  T 

UAL, 

96.07%) 

0.51 

41.19 

UD 

89.50% 

0.50 

26.35 

F\'CT  T 

94.36%) 

0.50 

36.04 

UUW 

90.28%) 

0.51 

22.48 

UKJ 

86.06% 

0.53 

11.09 

CTU  A  XT' 

LB  AY 

90.81% 

0.50 

29.25 

LLY 

87.15%o 

0.51 

15.84 

LKiCY 

93.06% 

0.50 

31.18 

r 

94.81% 

0.50 

46.84 

rDX 

91.15% 

0.50 

26.42 

rrL 

95.34% 

0.51 

13.77 

GD 

94.23% 

0.52 

28.81 

CjL 

94.96% 

0.50 

43.27 

GLW 

92  64%n 

n  'si 

U.  J  i 

zo.aJ 

GM 

95.46% 

0.51 

46.34 

GPS 

87.48% 

0.50 

25.23 

GR 

89.89% 

0.50 

18.43 

GTW 

91.38% 

0.51 

25.79 

HDI 

89.06% 

0.51 

13.89 

Significant 


yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 
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Table  1 0.  Continued 

otOCK 

T  T       T>     i  * 

Hit  Ratio 

UT  TTV  if 
HUM 

90.64% 

IDA  yf 
J±>M 

95.64% 

iLAU 

91.22% 

ilN  IL 

94.38% 

TXTT 
JNJ 

93.72% 

KJVl 

93.80% 

JvR 

82.82% 

LIZ 

89.26% 

r  T  T*/^ 
LLIL 

88.89% 

r  T  \/ 
LLY 

/~\^       Art  / 

92.74% 

T  A  JIT* 
LMl 

95.16% 

T  T  T\  / 

LUV 

92.33% 

MCD 

93.61% 

MO 

92.09% 

MON 

92.42% 

MOT 

94.90% 

\  if  T4  T/" 

MRK 

94.84% 

MYG 

87.36% 

NKE 

85.71% 

NOK 

94.97%) 

VTC  A  XTAT" 

NSANY 

97.18% 

VTV  TO 

NVb 

96.39% 

VT'VT'T 

NXTL 

91.39% 

ORCL 

97.35% 

PBG 

82.37% 

PD 

97.61% 

PEP 

89.88% 

PHA 

93.99% 

KBK 

84.03% 

r>  T 

RL 

83.92% 

ROS 1 

80.37% 

S 

90.61% 

SNE 

96.79% 

SO 

93.75% 

SOI 

96.24% 

TGT 

88.21% 

TM 

96.57% 

TMPW 

93.36% 

TOM 

90.00% 

TXN 

93.93% 

TXU 

95.96% 

USAI 

90.20% 

Chance 
0.50 
0.50 
0.51 
0.50 
0.51 
0.50 
0.52 
0.50 
0.50 
0.50 
0.51 
0.50 
0.50 
0.51 
0.50 
0.50 
0.52 
0.52 
0.50 
0.51 
0.50 
0.50 
0.52 
0.50 
0.50 
0.55 
0.50 
0.51 
0.50 
0.52 
0.54 
0.50 
0.50 
0.50 
0.52 
0.51 
0.50 
0.50 
0.54 
0.50 
0.51 
0.50 


Z-Statistic 

Significant 

16.37 

yes 

43.50 

yes 

11.65 

yes 

42.59 

yes 

31.65 

yes 

34.79 

yes 

14.29 

yes 

12.21 

yes 

13.40 

yes 

31.69 

yes 

34.69 

yes 

28.66 

yes 

31.86 

yes 

31.52 

yes 

20.80 

yes 

40.70 

yes 

30.55 

yes 

20.08 

yes 

14.51 

yes 

28.46 

yes 

19.44 

yes 

24.84 

yes 

22.36 

yes 

32.35 

yes 

11.33 

yes 

14.58 

yes 

24.18 

yes 

23.18 

yes 

12.04 

yes 

9.09 

yes 

7.76 

yes 

25.69 

yes 

29.47 

yes 

14.43 

yes 

14.32 

yes 

23.72 

yes 

25.13 

yes 

22.29 

yes 

10.72 

yes 

27.36 

yes 

19.37 

yes 

21.18 

yes 

74 


Table  10.  Continued 


fill  Kaiio 

Chance 

Z-Statistic 

Significant 

VV  JJ/XN 

07  QAO/ 

0  /  .oUto 

A  C  1 

0.51 

14.92 

yes 

w  nrv. 

on  OTo/ 

0.50 

16.74 

yes 

WIN 

83  77% 

1  U.Uo 

yes 

WMT 

95.07% 

0.50 

36.39 

yes 

WPPGY 

97.42% 

0.51 

17.51 

yes 

X 

92.54% 

0.53 

18.45 

yes 

XEL 

93.00% 

0.50 

13.78 

yes 

XOM 

95.08% 

0.50 

34.19 

yes 
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Table  1 1 .  Volume  classification  training  set  leave-one-out  cross-validation 


Stock 


Hit  Ratio 


Chance 

Z-Statistic 

Significant 

0.50 

25.33 

yes 

0.50 

19.37 

yes 

0.56 

9.88 

yes 

0.50 

12.40 

yes 

0.51 

20.70 

yes 

0.51 

12.52 

yes 

0.51 

15.34 

yes 

0.50 

45.03 

yes 

0.50 

12.25 

yes 

0.52 

29.35 

yes 

0.50 

44.58 

yes 

0.50 

28.54 

yes 

0.51 

10.39 

yes 

0.51 

10.24 

yes 

0.50 

34.18 

yes 

0.50 

16.76 

yes 

0.50 

38.73 

yes 

0.50 

19.43 

yes 

0.51 

17.67 

yes 

0.50 

8.26 

yes 

0.50 

24.54 

yes 

0.50 

12.44 

yes 

0.50 

39.23 

yes 

0.51 

18.15 

yes 

0.51 

38.62 

yes 

0.50 

24.67 

yes 

0.50 

33.33 

yes 

0.51 

20.12 

yes 

0.53 

9.90 

yes 

0.50 

26.85 

yes 

0.51 

14.91 

yes 

0.50 

28.47 

yes 

0.50 

44.12 

yes 

0.50 

24.37 

yes 

0.51 

13.25 

yes 

0.52 

26.98 

yes 

0.50 

41.03 

yes 

0.51 

26.64 

yes 

0.51 

43.67 

yes 

0.50 

23.15 

yes 

0.50 

17.30 

yes 

0.51 

24.67 

yes 

0.51 

11.90 

yes 

AAPL 

ABS 

ADSX 

AEOS 

AET 

AFFX 

ANF 

AOL 

ASKJ 

AXP 

BA 

BAB 

BC 

BDK 

BMY 

BUD 

C 

ecu 

CHIR 

CMTN 

COX 

CREE 

CSCO 

CVC 

DAL 

DD 

DELL 

DOW 

DRI 

EBAY 

ELY 

ERICY 

F 

FDX 

FPL 

GD 

GE 

GLW 

GM 

GPS 

GR 

GTW 

HDI 


84.79% 

91.71% 

87.66% 

80.29% 

89.50% 

90.91% 

86.23% 

94.51% 

83.84% 

89.96% 

94.70% 

94.87% 

86.32% 

80.41% 

91.37% 

81.78% 

91.81% 

90.24% 

92.58% 

78.18% 

87.34% 

91.86% 

92.61% 

85.08% 

93.28% 

86.98% 

91.03% 

86.19% 

82.58% 

87.46% 

85.01% 

89.32% 

92.21% 

87.98% 

93.64% 

91.51% 

92.63% 

89.46% 

92.89% 

84.39% 

87.45% 

89.64% 

83.59% 
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Table  1 1 .  Continued 

Stock 

Hit  Ratio 

HUM 

87.68% 

IBM 

93.09% 

ILXO 

87.32% 

INTC 

91.70% 

JNJ 

91.21% 

KM 

90.80% 

KR 

78.24% 

LIZ 

87.19% 

LLTC 

83.84% 

LLY 

89.91% 

LMT 

92.68% 

LUV 

90.07% 

MCD 

90.75% 

MO 

89.94% 

MON 

90.44% 

MOT 

91.60% 

MRK 

91.98% 

MYG 

85.63% 

NKE 

81.60% 

NOK 

93.13% 

NSANY 

95.29% 

NVS 

95.56% 

NXTL 

88.44% 

ORCL 

95.90% 

PBG 

81.41% 

PD 

95.22% 

PEP 

87.27% 

PHA 

91.53% 

RBK 

79.23% 

RL 

81.41% 

ROST 

78.04% 

S 

87.51% 

SNE 

95.19% 

SO 

91.54% 

SOI 

93.61% 

TGT 

83.42% 

TM 

93.54% 

TMPW 

89.44% 

TOM 

83.18% 

TXN 

90.64% 

TXU 

95.11% 

USAI 

86.31% 

Chance 


0.50 

0.50 

0.51 

0.50 

0.51 

0.50 

0.52 

0.50 

0.50 

0.50 

0.51 

0.50 

0.50 

0.51 

0.50 

0.50 

0.52 

0.52 

0.50 

0.51 

0.50 

0.50 

0.52 

0.50 

0.50 

0.55 

0.50 

0.51 

0.50 

0.52 

0.54 

0.50 

0.50 

0.50 

0.52 

0.51 

0.50 

0.50 

0.54 

0.50 

0.51 

0.50 


Z-Statistic 


Significant 


15.18 
41.05 
10.54 
40.01 
29.81 
32.39 
12.15 
11.57 
11.66 
29.59 
32.75 
27.13 
29.74 
29.86 
19.83 
37.70 
28.52 
19.10 
12.84 
27.28 
18.67 
24.39 
20.68 
31.36 
10.99 
13.76 
22.59 
21.84 
10.34 
8.38 
7.07 
23.73 
28.46 
13.70 
13.46 
20.68 
23.50 
20.28 
8.69 
25.31 
19.00 
19.13 


yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

yes 
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Table  1 1 .  Continued 


Stock 

Hit  Ratio 

Chance 

Z-Statistic 

Significant 

WEN 

83.49% 

0.51 

13.16 

yes 

WHR 

88.60% 

0.50 

15.77 

yes 

WTTKl 

WliN 

73.25% 

0.50 

6.90 

yes 

WMT 

92.09% 

0.50 

33.97 

yes 

WPPGY 

95.99% 

0.51 

16.97 

yes 

X 

89.55% 

0.53 

17.06 

yes 

XEL 

85.60% 

0.50 

11.41 

yes 

XOM 

95.08% 

0.50 

34.19 

yes 
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Table  12.  Return  classification  holdout  sample  results 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

AAPL 

56.56% 

3.81 

yes 

ABS 

48.89% 

0.07 

no 

ADSX 

64.04% 

8.98 

yes 

AEOS 

63.01% 

4.95 

yes 

AET 

50.31% 

0.01 

no 

AFFX 

62.07% 

1.69 

no 

ANF 

58.33% 

1.67 

no 

AOL 

51.58% 

0.57 

no 

ASKJ 

37.29%) 

3.81 

no 

AXP 

56.25% 

5.75 

yes 

BA 

49.57% 

0.04 

no 

BAB 

42.06% 

5.88 

no 

BC 

25.00% 

14.00 

no 

BDK 

57.47% 

1.94 

no 

BMY 

52.93% 

1.52 

no 

BUD 

43.86% 

1.72 

no 

C 

42.34% 

11.65 

no 

ecu 

51.38% 

0.08 

no 

CHIR 

44.23% 

0.69 

no 

CMTN 

40.00% 

1.00 

no 

COX 

42.11% 

2.84 

no 

CREE 

54.08% 

0.65 

no 

CSCO 

45.76% 

2.80 

no 

CVC 

47.67% 

0.19 

no 

DAL 

47.07%) 

1.46 

no 

DD 

54.31% 

1.72 

no 

DELL 

43.91%, 

4.63 

no 

DOW 

50.00% 

0.00 

no 

DRI 

51.95% 

0.12 

no 

EBAY 

44.13% 

2.93 

no 

ELY 

71.60%) 

15.12 

yes 

ERICY 

53.45% 

1.38 

no 

F 

49.91% 

0.00 

no 

FDX 

53.06% 

0.92 

no 

FPL 

57.27% 

2.33 

no 

GD 

44.62% 

2.90 

no 

GE 

47.31% 

1.51 

no 

w 

Kli-i  W 

55.84% 

3.16 

yes 

GM 

54.87% 

5.64 

yes 

GPS 

43.67% 

2.53 

no 

GR 

46.25% 

0.45 

no 

GTW 

45.05% 

1.78 

no 

HDI 

35.09% 

5.07  1 

no 
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Table  12.  Continued 


otOCK 

T  T  '  A  T»     A  * 

Hit  Ratio 

Press's  Q 

Significant 

T  TT  T\  Jf 

HUM 

41.67% 

1.00 

no 

IBM 

45.38% 

2.96 

no 

LLXO 

76.47% 

4.76 

yes 

INTC 

48.39% 

0.55 

no 

TXTT 

JNJ 

51.45% 

0.29 

no 

KJVl 

48.09% 

0.69 

no 

rLR 

52.20% 

0.31 

no 

LIZ 

C  f\C\  / 

37.50% 

4.00 

no 

LLTC 

25.58% 

10.26 

no 

LLY 

54.51% 

1.98 

no 

LMT 

50.13% 

0.00 

no 

T  T  n  r 

LUV 

49.28% 

0.04 

no 

MCD 

59.35% 

9.73 

yes 

MO 

52.13% 

0.72 

no 

MON 

51.58% 

0.19 

no 

MOT 

45.65% 

2.87 

no 

MRK 

51.72% 

0.45 

no 

MYG 

51.76% 

0.11 

no 

NKE 

60.67% 

4.06 

yes 

NOK 

50.00% 

0.00 

no 

NSANY 

37.86% 

8.26 

no 

NVS 

54.95% 

1.78 

no 

NXTL 

48.63% 

0.11 

no 

51.33% 

0.16 

no 

PBG 

66.67% 

7.00 

yes 

PD 

41.79% 

1.81 

no 

PEP 

49.68% 

0.01 

no 

PrlA 

50.24% 

0.00 

no 

64.41% 

4.90 

yes 

RL 

68.42% 

2.58 

no 

ROSl 

23.53% 

4.76 

no 

IS 

39.22% 

7.12 

no 

SNE 

55.78% 

2.66 

no 

SO 

62.22% 

5.38 

yes 

bOl 

46.43% 

0.14 

no 

TGT 

45.65% 

1.04 

no 

TM 

A1  1  QO/ 

1 1.34 

yes 

TMPW 

48.33% 

0.07 

no 

TOM 

68.75% 

2.25 

no 

TXN 

57.96% 

5.73 

yes 

TXU 

48.33% 

0.07 

no 

USAI 

51.85% 

0.15 

no 

80 


Table  12.  Continued 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

WEN 

55.56% 

1.11 

no 

WHR 

40.70% 

2.98 

no 

WIN 

39.71% 

2.88 

no 

WMT 

51.41% 

0.25 

no 

WPPGY 

46.38% 

0.36 

no 

X 

37.25% 

3.31 

no 

XEL 

65.65% 

12.83 

yes 

XOM 

51.99% 

0.44 

no 
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Table  13.  Volume  classification  holdout  sample  results 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

AAPL 

54.38% 

1.66 

no 

ABS 

45.26% 

1.23 

no 

ADSX 

55.17% 

1.24 

no 

AEOS 

66.22% 

7.78 

yes 

AET 

54.60% 

1.38 

no 

AFFX 

50.00% 

0.00 

no 

ANF 

53.73% 

0.37 

no 

AOL 

49.11% 

0.18 

no 

ASKJ 

45.00% 

0.60 

no 

AXP 

45.88% 

2.47 

no 

A 

BA 

45.03% 

5.47 

no 

BAB 

57.02% 

4.63 

yes 

BC 

46.55% 

0.28 

no 

BDK 

55.81% 

1.16 

no 

BMY 

46.44% 

2.21 

no 

BUD 

58.72% 

3.31 

yes 

C 

52.02% 

0.81 

no 

ecu 

51.75% 

0.14 

no 

CHIR 

48.15% 

0.07 

no 

CMTN 

20.00% 

9.00 

no 

COX 

59.82%, 

4.32 

yes 

CREE 

44.33% 

1.25 

no 

csco 

48.68%. 

0.26 

no 

cvc 

50.59% 

0.01 

no 

A  T 

DAL 

53.30% 

1.85 

no 

DD 

56.47% 

3.88 

yes 

DELL 

47.13% 

1.03 

no 

DOW 

64.49% 

8.98 

yes 

DRI 

52.70% 

0.22 

no 

EBAY 

56.67% 

3.73 

yes 

ELY 

50.62% 

0.01 

no 

ERICY 

52.54% 

0.71 

no 

F 

51.05% 

0.23 

no 

FDX 

56.28% 

3.89 

yes 

FPL 

48.21% 

0.14 

no 

GD 

56.22% 

3.86 

yes 

GE 

51.90% 

0.72 

no 

GT  W 

vJL^  VV 

An  c/zo/ 
4/. 56% 

0.59 

no 

GM 

52.32% 

1.25 

no 

GPS 

50.00% 

0.00 

no 

GR 

59.26% 

2.78 

yes 
no 

GTW 

45.99% 

1.20 

HDI 

43.86% 

0.86 

no 
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Table  13.  Continued 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

HUM 

47.22% 

0.11 

no 

IBM 

50.63% 

0.05 

no 

ILXO 

64.71% 

1.47 

no 

INTC 

48.15% 

0.70 

no 

JNJ 

52.65% 

0.95 

no 

KM 

48.72% 

0.31 

no 

KR 

44.59% 

1.84 

no 

LIZ 

43.75% 

1.00 

no 

LLTC 

36.96% 

3.13 

no 

LLY 

44.31% 

3.19 

no 

LMT 

46.92% 

1.48 

no 

LUV 

55.77%. 

2.77 

yes 

MCD 

49.65% 

0.01 

no 

MO 

54.25% 

2.89 

yes 

MON 

50.27% 

0.01 

no 

MOT 

48.66% 

0.27 

no 

MRK 

56.49% 

6.23 

yes 

MYG 

40.48% 

3.05 

no 

NKE 

42.35%) 

1.99 

no 

NOK 

47.55% 

0.49 

no 

NSANY 

44.00% 

1.80 

no 

NVS 

54.29%) 

1.29 

no 

NXTL 

54.11% 

0.99 

no 

ORCL 

50.44% 

0.02 

no 

PBG 

43.86% 

0.86 

no 

PD 

39.39% 

2.97 

no 

PEP 

46.48% 

0.70 

no 

FHA 

51.76%) 

0.25 

no 

RBK 

50.85% 

0.02 

no 

RL 

31.25%) 

2.25 

no 

ROST 

70.59% 

2.88 

yes 

S 

51.37% 

0.11 

no 

SNE 

61.31% 

10.18 

yes 

SO 

51.19% 

0.05 

no 

SOI 

34.38% 

3.13 

no 

TGT 

46.27% 

0.75 

no 

I  IVl 

AHA  no/ 

47.47% 

0.41 

no 

TMPW 

44.64% 

0.64 

no 

TOM 

37.50% 

1.00 

no 

TXN 

53.30% 

0.99 

no 

TXU 

36.51% 

4.59 

no 

USAI 

45.95% 

0.73 

no 
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Table  13.  Continued 


btOCK 

T  T  *  J.  n    A ' 

Hit  Ratio 

Press's  Q 

Significant 

WEN 

56.18% 

1.36 

no 

WHR 

49.44% 

0.01 

no 

W  UN 

'50  C70/ 
JO.  J  /% 

3.66 

no 

WMT 

48.59% 

0.25 

no 

WPPGY 

56.34% 

1.14 

no 

X 

34.38% 

6.25 

no 

XEL 

54.89% 

1.27 

no 

XOM 

59.36% 

9.93 

yes 
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Table  14.  Return  classification  with  x,  training  set  results 


Stock 

Step  Entered 

Hit  Ratio 

Chance 

Z-Statistic 

Significant 

AAPL 

30 

93.88% 

0.50 

32.13 

yes 

ABS 

94 

95.27% 

0.50 

20.81 

yes 

ADSX 

25 

89.96% 

0.51 

11.82 

yes 

AEOS 

2 

87.98% 

0.50 

15.37 

yes 

AET 

38 

93.98% 

0.51 

22.73 

yes 

AFFX 

45 

97.08% 

0.52 

14.12 

yes 

ANF 

27 

88.09% 

0.54 

14.80 

yes 

AOL 

11 

97.16% 

0.51 

47.03 

yes 

ASKJ 

26 

91.38% 

0.51 

14.61 

yes 

AXP 

6 

95.92% 

0.51 

34.98 

yes 

BA 

619 

96.81% 

0.50 

46.78 

yes 

BAB 

77 

96.63% 

0.50 

29.59 

yes 

BC 

25 

91.26% 

0.51 

11.50 

yes 

BDK 

1 

84.69% 

0.50 

11.78 

yes 

BMY 

1 

95.24% 

0.50 

37.74 

yes 

BUD 

1 

85.74% 

0.50 

18.71 

yes 

C 

10 

96.01% 

0.50 

42.79 

yes 

CHIR 

65 

93.68% 

0.50 

18.27 

yes 

CMTN 

13 

87.44% 

0.50 

10.98 

yes 

COX 

20 

91.13% 

0.50 

27.00 

yes 

cvc 

44 

89.67% 

0.50 

20.79 

yes 

DAL 

2 

96.95% 

0.50 

43.00 

yes 

DD 

39 

95.61% 

0.51 

29.37 

yes 

DELL 

108 

94.19% 

0.50 

36.22 

yes 

DOW 

3 

94.10% 

0.50 

24.81 

yes 

DRI 

52 

92.53% 

0.57 

11.95 

yes 

EBAY 

1 

96.52% 

0.50 

33.05 

yes 

ELY 

77 

93.11% 

0.50 

18.26 

yes 

ERICY 

1 

94.53% 

0.51 

31.43 

yes 

F 

5 

95.55% 

0.50 

47.46 

yes 

FDX 

36 

92.08% 

0.51 

25.90 

yes 

GD 

4 

94.05% 

0.50 

29.56 

yes 

GE 

295 

93.40% 

0.50 

42.04 

yes 

GLW 

16 

94.48% 

0.51 

30.01 

yes 

GM 

113 

95.62% 

0.50 

47.34 

yes 

GPS 

18 

92.12% 

0.52 

26.88 

yes 

GR 

7 

92.59% 

0.50 

19.49 

yes 

GTW 

95 

92.99% 

0.51 

27.23 

yes 

HDI 

4 

89.14% 

0.50 

13.75 

yes 

HUM 

15 

94.40% 

0.53 

16.43 

yes 

IBM 

3 

95.47% 

0.51 

43.49 

yes 

INTC 

5 

95.47% 

0.51 

43.10 

yes 

JNJ 

1 

95.89% 

0.50 

33.55 

yes 
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Table  14.  Continued 


olOCK 

Step  Entered 

rill  KatlO 

Lnance 

Z-Statistic 

Significant 

1 

no  /COO/ 

A  C  A 

0.50 

33.93 

yes 

1  A 

14 

oo.54% 

A  C  A 

0.50 

16.76 

yes 

LIZ, 

4U 

HA  AO/ 

90.24% 

0.52 

12.13 

yes 

1 
1 

A  C  1 

14.31 

yes 

r  T  V 

J  1 

rvc  /CIO/ 

y5.oi% 

0.50 

33.31 

yes 

T  'KAT 

1  AO 

loo 

yo.23  % 

A  C  A 

0.50 

35.92 

yes 

bU  V 

1 
1 

y4.oj  % 

A  C  1 

0.51 

29.69 

yes 

ML,U 

4 

y3.27% 

A   C  C\ 

0.50 

32.11 

yes 

0  1  A 

ZIO 

H/l   1  TO/ 

y4.  liTo 

A  C  1 

33.29 

yes 

MUJN 

lU 

n*?  coo/ 
y3.5o  % 

A  C  1 

0.51 

20.89 

yes 

\AC\'T 

1  "1 

CiA  C/^O/ 

y4.jo% 

A  C  1 

0.51 

39.62 

yes 

\A'DV 

MKJv 

1 

1 

n/r  AT  0/ 

yo. U3% 

A  C  f\ 

0.50 

32.57 

yes 

M  YLr 

jI 

O  /I   yl  'J  0/ 

o4.43% 

A  C  A 

0.50 

1  /A  in 

19.57 

yes 

1 

1 

Ai  cno/ 

yi.57% 

A  C  A 

0.50 

16.93 

yes 

IN  UK. 

A/l  CTO/ 

y4.o7% 

A  C(\ 

0.50 

29.01 

yes 

KTC  A  XTV 

1 
1 

yo.7o% 

A  C  1 

0.51 

A  r\  1  o 

19.18 

yes 

IN 

A  1 

AT  TOO/ 

y /.7o% 

A  C  A 

0.50 

25.67 

yes 

MVTT 

IN  A 1 L 

1  0 

io 

A1  y1<C0/ 

yi.4o% 

A  c  r\ 

0.50 

23.89 

yes 

rrJLr 

"J  "J 

OO  0/^0/ 
OZ.OO/o 

A  C  1 

0.51 

1 1.33 

yes 

rU 

Z4 

AT  "^00/ 

y /.2o% 

A  C  £1 

0.56 

14.25 

yes 

DThD 
rtLr 

1  T 
1/ 

AA  /CTO/ 

yu.o/% 

A 

0.52 

23.85 

yes 

rrlA 

1 

lo 

AC  /I/CO/ 

y5.oo% 

A  C  A 

0.50 

24.67 

yes 

D  D  V 

y 

oo.oU% 

0.51 

12.93 

yes 

1  A 

O/C   1  1  0/ 

00.1  1% 

0.52 

9.90 

yes 

c 

c  c 

D  J 

Ai  nno/ 

yi.77% 

0.55 

23.55 

yes 

Z 

AT  T^O/ 

y7.22% 

0.51 

29.50 

yes 

1 A 

AO  000/ 

y2.2o  % 

A  CO 

0.53 

25.18 

yes 

T\A 

00 
00 

AT  CylO/ 

y /.54% 

A  c  r\ 

0.50 

25.72 

yes 

z 

AO  A  AO/ 

yz.uu/o 

A  C  1 

0.51 

12.44 

yes 

1  AiN 

AC  OAO/ 

yj.oUyo 

A  C  A 

0.50 

28.54 

yes 

TYT  T 
1 AU 

1  Q 

ly 

A/:  OOO/ 

yo.00% 

A  C  1 

0.51 

20.00 

yes 

r  TC  A  T 

1 

i 

AI  A-OO/ 

y3.y2% 

0.50 

23.18 

yes 

WHIN 

-54 

Ar\  ^  cc\/ 

yo.o5% 

0.53 

15.76 

yes 

WHR 

1 1 

1  1 

/O 

U.  jU 

1  Q  AA 

ly.uu 

yes 

WIN 

23 

89.91% 

0.51 

11.76 

yes 

WMT 

1 

97.07% 

0.50 

38.19 

yes 

X 

72 

89.62% 

0.55 

16.52 

yes 

XEL 

48 

90.97% 

0.50 

13.61 

yes 

XOM 

3 

95.45% 

0.50 

34.87 

yes 
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Table  15.  Volume  classification  with  x,  training  set  results 


Stock 

Step  XI  Entered 

Hit  Ratio 

Chance 

Z-Statistic 

Significant 

AAPL 

1 

93.60% 

0.50 

31.55 

yes 

ABS 

1 

98.28% 

0.50 

22.05 

yes 

ADSX 

90.27% 

0.55 

10.71 

yes 

AEOS 

1 

87.47% 

0.50 

15.23 

yes 

AET 

1 

96.73% 

0.51 

24.37 

yes 

AFFX 

98.31% 

0.51 

14.58 

yes 

ANF 

1 

91.67% 

0.51 

17.63 

yes 

AOL 

1 

97.57% 

0.50 

47.65 

yes 

ASKJ 

1 

95.99% 

0.50 

16.54 

yes 

AXP 

1 

96.51% 

0.52 

34.37 

yes 

BA 

10 

97.70% 

0.50 

47.11 

yes 

BAB 

1 

97.69% 

0.50 

30.05 

yes 

BC 

94.61% 

0.50 

12.65 

yes 

BDK 

1 

91.16% 

0.51 

13.91 

yes 

BMY 

1 

96.75% 

0.50 

38.44 

yes 

BUD 

1 

88.00% 

0.50 

19.72 

yes 

C 

1 

96.75% 

0.50 

42.87 

yes 

ecu 

1 

92.88% 

0.50 

20.22 

yes 

CHIR 

1 

96.15% 

0.51 

19.12 

yes 

CMTN 

1 

90.23% 

0.50 

11.74 

yes 

COX 

94.47% 

0.50 

29.00 

yes 

csco 

1 

96.43% 

0.50 

42.25 

yes 

cvc 

1 

89.41% 

0.51 

20.25 

yes 

DAL 

1 

97.66% 

0.51 

42.36 

yes 

DD 

1 

93.82% 

0.50 

28.82 

yes 

DELL 

1 

97.34% 

0.50 

38.08 

yes 

DOW 

1 

91.28% 

0.52 

22.22 

yes 

DRI 

91.81% 

0.54 

12.67 

yes 

EBAY 

1 

95.45% 

0.50 

32.16 

yes 

ELY 

1 

90.34% 

0.51 

16.62 

yes 

ERICY 

1 

94.84% 

0.50 

32.07 

yes 

F 

1 

96.72% 

0.50 

48.38 

yes 

FDX 

1 

94.17% 

0.50 

27.85 

yes 

FPL 

95.32% 

0.51 

13.72 

yes 

GD 

1 

96.16% 

0.52 

29.89 

yes 

GE 

1 

96.09% 

0.50 

44.18 

yes 

GLW 

1 

94.59% 

0.51 

30.11 

yes 

GM 

97.34% 

0.51 

47.84 

yes 

GPS 

92.51% 

0.50 

28.46 

yes 

GR 

20 

91.94% 

0.50 

19.14 

yes 

GTW 

96.09% 

0.51 

28.68 

yes 

HDI 

92.88% 

0.50 

14.99 

yes 

HUM  1 

3 

91.84% 

0.50 

16.56 

yes 
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Table  15.  Continued 


Stock 

Sten  Y1  Fntprpf 

T-fit  Ratio 

v_/ndncc 

z.-staiisiic 

Significant 

1 

1 

OA  070/ 
fu.y  /  /o 

0  s  1 
U.J  1 

A  A  11 

44.  J 1 

yes 

o 

01  070/, 
yj.tj  /  /o 

A  SA 
U.jU 

1  O  1  A 

iz.lU 

yes 

1 

1 

Q«  870/ 
i/J.oZ  /o 

A  so 
U.jU 

A1  T> 

4j. ZZ 

yes 

TNT 

1 

1 

yo.  It  /o 

A  SI 
U.J  1 

11  1  A 

Jj.  lo 

yes 

1  <: 

1 J 

04  4^;<'^ 

yi.f  0  /o 

A  SA 

U.jU 

1A  OA 

j4.yo 

yes 

IVIV 

1 

1 

87  7/^0/ 
0  / .  /  0  /o 

0  <7 

U.  jz 

10. JO 

yes 

T  T7 

OA  AOO/ 

yu.oo  /O 

A  ^A 

U.jU 

1  0  CA 

Iz.jU 

yes 

T  T  TP 

Q 

00  1  40/^ 

A  SA 
U.jU 

1  1  lA. 

1  J.  /o 

yes 

I  T  Y 

1 

1 

0"?  0^0/ 

yj.uo  /o 

A  ^A 
U.jU 

Ji.JJ 

yes 

T  MT 

1 

i 

OA  A«o/ 

yo.uj  /o 

A  ^  1 
U.J  1 

JJ.41 

yes 

T  TIV 

1 

1 

04  <\70/. 

A  SA 

U.jU 

zy.oo 

yes 

fi 
o 

04  100/, 
/o 

A  SA 
U.jU 

10  1  A 

JZ.  14 

yes 

MO 

1 

1 

OA  040/ 

A  ^  1 
U.J  1 

Jj.y4 

yes 

MON 

1 

1 

OA  410/ 

yo.'tj  /o 

A  ^A 

U.jU 

llAy 

yes 

MOT 

1 

1 

OA  1  70/ 
yo.  IZ  /O 

A  SA 
U.jU 

A  1  O/C 

41. zo 

yes 

MRK 

1 

1 

OA  770/, 
7O.  /  /  /o 

A  S7 
U.  jZ 

11  /CC 

jI.dj 

yes 

1 

1 

01  ^70/ 
y  i .  J  /  /o 

A  ^7 

U.  JZ 

zz.4o 

yes 

1 

1 

01  700/. 
yj .  /  u  /o 

A  ^0 
U.jU 

1  /.  /o 

yes 

J 

QC  A'sO/, 

yj.oj  /O 

A  ^  1 
U.J  1 

0  0  OA 

zo.yU 

yes 

1 

1 

07  880/ 

y  /  .00  /o 

A  ^A 

U.jU 

1  n  '7'? 
19.73 

yes 

1 

1 

08  OAO/ 

yo.uo  /o 

A  <A 

U.jU 

25.73 

yes 

NXTT 

01  7  AO/ 

y  1 .  /  0  /o 

A  <7 

U.  JZ 

yes 

1 

1 

08  700/ 

yo./yyo 

U.jU 

33.00 

yes 

PBG 

1 

1 

81  070/, 
oj.y  /  /o 

0  SA 
U.jU 

1  1  OA 

1  l.yU 

yes 

PD 

07  970/, 
y  /  .z  /  /o 

A  SS 

U.J  J 

1  A  A(^ 

14.40 

yes 

PEP 

1 

1 

OS  A^o/ 
yj.oj  /o 

A  SA 
U.jU 

11 .0  1 

yes 

PHA 

1 

1 

OS  400/, 
yj.ty  /o 

A  SI 
U.J  1 

zj.yy 

yes 

RBK 

■3 
-J 

88  sno/, 

00.  JU  /O 

A  <A 
U.jU 

13. oz 

yes 

ROST 

1 

z. 

8S  080/. 
0  j.yo  /o 

0 

U.  j4 

A  yl  1 

9.41 

yes 

1 

1 

04  010/ 

y4.yi  /o 

A  CA 

U.jU 

28.40 

yes 

SNF 

1 

1 

08  OAO/ 

yo.uu/o 

A  zr\ 
U.jU 

1A 

30.23 

yes 

SO 

1 

1 

OS  770/ 

yj.zz/o 

A  CA 

U.jU 

14.92 

yes 

SOT 

JO 

OA  7/10/ 

U.jz 

14.32 

yes 

TGT 

1 
1 

01  110/ 

y 1 . 1 1  To 

0.51 

25.55 

yes 

TA/f 

1  IVl 

1 

O/^  C70/ 

0.50 

25.13 

yes 

TMPW 

1  IVlx  VV 

1 
1 

OA  010/ 

yo.ojyo 

0.50 

24.08 

yes 

TOM 

9 

92.73% 

0  54 

1 1  SI 
1 1  .J  J 

yes 

TXN 

1 

94.44% 

0.50 

27.68 

yes 

TXU 

4 

94.68% 

0.51 

18.81 

yes 

USAI 

1 

93.80% 

0.50 

23.08 

yes 
yes 

WEN 

1 

93.30% 

0.51 

17.17 

WHR 

21 

91.92% 

0.50 

17.13 

yes 
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Table  15.  Continued 


OlOCK 

oiep  Ai  cnierea 

rllt  Katio 

Lnance 

Z-atatistic 

Significant 

WTN 
VV  Liy 

1 

oo.  1  O  /o 

n 

u.  ju 

11/11 

1 1.41 

yes 

WMT 

1 

97.93% 

0.50 

38.71 

yes 

WPPGY 

1 

98.57% 

0.51 

17.94 

yes 

X 

1 

94.03% 

0.53 

19.14 

yes 

XEL 

4 

94.94% 

0.50 

14.41 

yes 

XOM 

1 

95.98% 

0.50 

34.87 

yes 
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Table  16.  Return  classification  with  x,  holdout  sample  results 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

AAPL 

57.01% 

4.35 

yes 

ABS 

54.81% 

1.25 

no 

ADSX 

56.14% 

1.72 

no 

AEOS 

49.32% 

0.01 

no 

AET 

49.07% 

0.06 

no 

AFFX 

41.38% 

0.86 

no 

ANF 

61.67% 

3.27 

yes 

AOL 

52.98% 

2.03 

no 

ASKJ 

44.07% 

0.83 

no 

AXP 

51.90% 

0.53 

no 

BA 

50.43% 

0.04 

no 

BAB 

47.21% 

0.73 

no 

BC 

35.71% 

4.57 

no 

BDK 

50.57% 

0.01 

no 

BMY 

46.40% 

2.31 

no 

BUD 

45.61% 

0.88 

no 

C 

46.77% 

2.06 

no 

CHIR 

61.54% 

2.77 

yes 

CMTN 

40.00% 

1.00 

no 

COX 

35.09% 

10.14 

no 

cvc 

55.81% 

1.16 

no 

DAL 

49.65% 

0.02 

no 

DD 

55.17% 

2.48 

no 

DELL 

41.03% 

10.05 

no 

DOW 

50.00% 

0.00 

no 

DRI 

57.14% 

1.57 

no 

EBAY 

46.95% 

0.79 

no 

ELY 

60.49% 

3.57 

yes 

ERICY 

50.69% 

0.06 

no 

F 

56.72% 

9.81 

yes 

FDX 

49.80% 

0.00 

no 

GD 

47.81% 

0.48 

no 

GE 

42.31% 

12.31 

no 

GLW 

51.52% 

0.21 

no 

GM 

55.87% 

8.22 

yes 

GPS 

38.61% 

8.20 

no 

GR 

37.50% 

5.00 

no 

(J  I  w 

48.35%) 

0.20 

no 

HDI 

50.88% 

0.02 

no 

HUM 

41.67% 

1.00 

no 

IBM 

47.40% 

0.94 

no 

INTC 

51.04% 

0.23 

no 

JNJ 

52.03% 

0.57 

no 

90 


Table  16.  Continued 


rllt  KatlO 

rress  s  Q 

Significant 

NJVl 

A/Z  000/ 

1  r\  1 

1.91 

no 

IVR. 

Cf\  0/10/ 

0.06 

no 

f  T7 

A1  "7^0/ 

1    A  A 

1.00 

no 

T  T  TP 

in  010/ 
J  /.ZIto 

2.81 

no 

T  T  V 

AO  TIO/ 

A  1 

0.15 

no 

T  MT 
LlVl  1 

AC\  KOO/ 

4y.ozyo 

A  A'^ 

0.02 

no 

I  Tr\/ 

LU  V 

/I '3  C/IO/ 

43.j4% 

3.49 

no 

IVTPD 

CA  /COO/ 

2.43 

no 

4y.ozyo 

A  A'^ 

0.02 

no 

C5  /ZOO/ 

1.03 

no 

AC  "5  00/ 

3.23 

no 

CI   1  00/ 

J  1. 1 y To 

A  1 

0.21 

no 

IVl  I  o 

/I  0  O  yl  0/ 

A  1  1 

0.11 

no 

In  JVC 

jU.i47o 

13.76 

no 

jU.UU% 

A  AA 

0.00 

no 

i^lO/MN  I 

/IT  1  /lO/ 

4/.  1470 

0.46 

no 

IN  V  o 

f.A  000/ 
04.zy% 

14.86 

yes 

\rVTT 

CA  TOO/ 

j4.  /y% 

1.34 

no 

PRO 

zro  OCO/ 

oo.zjyo 

O  Af\ 

8.40 

yes 

1  c  ooo/ 

5.39 

no 

PFP 

/17  770/ 
4  / .  /  /  /o 

A  '2  1 

no 

PHA 

/lO  000/ 

4U.UU70 

O  A 

8.20 

no 

IVDXV 

jy.  jz  /o 

7  OC 

Z.05 

no 

^8  00/ 
Jo.oZ/o 

o  c  o 

0.53 

no 

^1  ATO/ 
J  1  .Oj  /o 

O  1  /C 

U.16 

no 

z:  1  0  10/ 

ol.ol% 

11.10 

yes 

TdT 

070/ 

1     /<  ^ 

1.42 

no 

TIU 

CO  700/ 

JO.Zo/o 

A  An 

yes 

1  VJiVl 

000/ 

75.00% 

4.00 

yes 

1  /^Ly 

C7  coo/ 

5.12 

yes 

CO  000/ 

A  AA 

0.00 

no 

CI    O  CO/ 

51.85% 

0.15 

no 

WFN 

VV  J_»i>J 

f.(\  000/ 
OU.  UU% 

3.60 

yes 

WHR 

34  88% 

7  86 

no 

WIN 

27.94% 

13.24 

no 

WMT 

45.14% 

3.01 

no 

X 

47.06% 

0.18 

no 

XEL 

64.89% 

11.61 

yes 

XOM 

55.23% 

3.04 

yes 
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Table  17.  Volume  classification  with  x,  holdout  sample  results 


Stock 

Hit  Ratio 

Press's  Q 

Significant 

AAPL 

58.99% 

7.01 

yes 

ABS 

58.21% 

3.61 

yes 

ADSX 

62.28% 

6.88 

yes 

AEOS 

46.58% 

0.34 

no 

AET 

49.33% 

0.03 

no 

AFFX 

68.97% 

4.17 

yes 

ANF 

56.67% 

1.07 

no 

AOL 

53.15% 

2.21 

no 

ASKJ 

47.46% 

0.15 

no 

AXP 

46.24% 

2.03 

no 

BA 

50.74% 

0.12 

no 

BAB 

57.21% 

4.76 

yes 

BC 

73.21% 

12.07 

yes 

BDK 

57.65% 

1.99 

no 

BMY 

50.23% 

0.01 

no 

BUD 

62.26% 

6.38 

yes 

C 

48.02% 

0.75 

no 

ecu 

54.63% 

0.93 

no 

CHffi. 

52.94% 

0.18 

no 

CMTN 

52.00% 

0.04 

no 

COX 

64.55% 

9.31 

yes 

csco 

43.73% 

5.89 

no 

cvc 

59.04% 

2.71 

yes 

DAL 

52.35% 

0.89 

no 

DD 

52.38% 

0.52 

no 

DELL 

58.31% 

8.47 

yes 

DOW 

63.54% 

7.04 

yes 

DRI 

48.65% 

0.05 

no 

EBAY 

60.58% 

9.31 

yes 

ELY 

55.56% 

1.00 

no 

ERICY 

55.15% 

2.88 

yes 

F 

56.45% 

8.51 

yes 

FDX 

54.73% 

2.18 

no 

FPL 

41.82% 

2.95 

no 

GD 

53.82% 

1.45 

no 

GE 

45.25% 

4.46 

no 

GLW 

45.45% 

1.91 

no 

KJLvl 

44.35% 

7.35 

no 

GPS 

56.96% 

3.06 

yes 

GR 

57.50% 

1.80 

no 

GTW 

60.44% 

7.93 

yes 

HDI 

52.63% 

0.16 

no 

HUM 

36.11% 

2.78 

no 
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Table  17.  Continued 


riu  Kaiio 

Press's  Q 

Significant 

TRIU 

LDiVl 

T  1  0/ 
jU.J  1  /o 

U.Ul 

no 

/II  1  00/ 

41.1070 

A  C  O 

0.53 

no 

UN  1 

J  J.  /j/o 

o  oc 
z.o5 

yes 

J  IN  J 

C/l    1  QO/ 

o  o  c 

1.55 

no 

IVIVI 

/IC  710/ 

48.  / 1  70 

A  O  1 

no 

J  /  .yoTo 

0  no 
3.yo 

yes 

T  T7 

/I  ^  1  AO/ 

43.  lOTo 

0.58 

no 

T  T  TC 

1  QO/ 

44.  lyyo 

A  C  O 

0.58 

no 

r  T  V 

12.55 

no 

Li  VI 1 

^4  OQO/ 

j4.zy  /o 

O  OT 

Z.o3 

yes 

r  TTV 

C7  0/10/ 
J  /  .8470 

C  AO 

5.0z 

yes 

4  /  .407o 

A  7 1 

0.71 

no 

MO 

J  1 .  j47o 

A  17 

U.3  / 

no 

J0.U470 

2.66 

no 

cn  C70/ 
jy.j  /  /o 

13.59 

yes 

iVlJviV 

OZ. 41/70 

OO  CA 

ZZ.56 

yes 

ivfvn 

IVl  I  VJ 

A1  110/ 

4 J. J  /yo 

1  AC 

1.46 

no 

coo/ 

3.05 

yes 

IN  wrv. 

/I  A  770/ 

40.  /  /  TO 

A  O  y1 

U.o4 

no 

M<J  AMV 
i>0/vlN  I 

c/r  A70/ 
jO.O  I/O 

O  1  0 

2.13 

no 

i>  V  lJ 

CQ  1  -70/ 

jy.  1  /yo 

5.69 

yes 

AO  r>70/ 
OZ.U  /yo 

8.45 

yes 

!)3.z4/o 

0.91 

no 

PRG 

OJ.  1 0  /o 

1  QC 

j.y5 

yes 

pr> 
ru 

DJ.85% 

0.38 

no 

PFP 

jO.  j4  /o 

O  00 

Z.Z8 

no 

PT-T  A 
rn/\ 

CO  7A0/ 

j8.  /Oto 

C  {\C 

5.96 

yes 

IVDJS. 

CI  7O0/ 
J  1 .  /Z70 

0.07 

no 

IvvJij  1 

7A  /I  70/ 

/0.4/yo 

4.76 

yes 

o 

CO  AOO/ 

JZ.Uoyo 

0.25 

no 

CO  0  00/ 

59.30% 

6.75 

yes 

AO  7O0/ 

4o.7o% 

0.05 

no 

/in  nno/ 
4U.UUyo 

1  f\(\ 

1.00 

no 

1  Lr  1 

o r\  o cn/ 

39.85% 

5.48 

no 

1 M 

50.97% 

0.06 

no 

1  Mr  W 

42.86% 

1.14 

no 

TOM 

31  25% 

no 

TXN 

60.71% 

10.29 

yes 

TXU 

46.67% 

0.27 

no 

USAI 

49.04% 

0.04 

no 

WEN 

58.62% 

2.59 

no 

WHR 

50.00% 

0.00 

no 
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Table  17.  Continued 


rill  Katio 

rress  s  Q 

Significant 

WIN 

.'Tw'  /O 

U.J  J 

no 

WMT 

63.75% 

23.38 

yes 

WPPGY 

39.13% 

3.26 

no 

X 

33.33% 

5.67 

no 

XEL 

62.31% 

7.88 

yes 

XOM 

51.14% 

0.14 

no 

CHAPTER  6 

DISCUSSION,  CONCLUSIONS,  AND  FUTURE  DIRECTIONS 
The  purpose  of  Chapter  6  is  to  summarize  the  findings  in  this  dissertation  in  Hght 
of  the  proposed  research  questions,  conclude  and  provide  directions  for  future  research. 
Section  6.1  contains  the  summary.  Section  6.2  provides  a  discussion  of  the  results. 
Section  6.3  provides  a  conclusion.  Finally,  Section  6.4  discusses  the  directions  for  future 
research. 

6.1  Summary 

The  purpose  of  this  dissertation  is  to  provide  a  methodology  for  an  organization  to 
use  to  aid  in  environmental  scanning  of  web  documents.  The  methodology  proposed 
involves  combining  the  vector  space  model  representation  of  the  documents  with  linear 
discriminant  analysis  as  the  method  of  classification  of  the  documents.  The  vector  space 
model  is  used  to  represent  documents  in  the  text  classification  literature  and  linear 
discriminant  analysis  is  a  well-founded  method  of  classification.  Linear  discriminant 
analysis  provides  a  linear  discriminant  fiinction  found  via  a  training  set  of  documents 
with  known  classification.  The  linear  discriminant  fiinction  can  then  be  used  to  classify 
fiiture  documents  that  appear  on  the  web  to  give  an  organization  an  idea  about  what  new 
documents  are  indicating.  The  process  of  collecting,  representing  and  classifying  the 
training  set  and  the  new  documents  is  automated  via  Java  programs. 

The  methodology  developed  in  this  dissertation  is  tested  empirically  on  news 
documents  that  appear  at  designated  web  sites  about  a  set  of  186  publicly  traded 
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companies.  After  excluding  companies  that  have  too  few  documents  to  analyze, 
incomplete  document  collections  and  discontinued  public  trading,  a  set  of  93  stocks  is 
used  in  this  empirical  study.  The  study  is  conducted  in  light  of  four  research  questions 
outlined  in  Section  3.2  and  Section  4.2.  The  results  are  organized  according  to  these  four 
questions. 

The  first  research  question  addresses  how  well  the  linear  discriminant  ftanction 
classifies  the  training  set  of  documents,  hi  order  to  determine  the  answer  to  this  question 
the  validity  of  the  80%  training  set  is  checked  using  the  z-statistic  comparing  the  hit  ratio 
to  the  proportional  chance  criteria  for  both  the  training  sample  classification  matrix  and 
the  leave-one-out  classification  matrix.  The  proportional  chance  criteria  value  (chance)  is 
given  for  each  stock  with  classification  based  on  return  in  Table  8.  Additionally,  Table  8 
provides  the  hit  ratio  and  z-statistic  based  on  the  training  sample  classification  matrix. 
Table  9  provides  the  same  information  for  the  leave-one-out  classification  matrix.  For 
both  tables,  for  every  stock  the  z-statistic  comparing  chance  classification  to  the 
classification  matrices  is  significant  at  10%.  Table  10  provides  the  proportional  chance 
criteria  value  (chance),  the  hit  ratio  and  z-statistic  for  each  stock  with  classification  based 
on  volume.  Table  1 1  provides  the  same  information  for  the  leave-one-out  classification 
matrix.  Once  again,  for  both  tables,  for  every  stock  the  z-statistic  is  significant  at  10%. 
The  average  hit  ratio  for  return  classification  for  the  training  classification  matrix  is 
92. 1 1  %  and  for  the  leave-one-out  classification  matrix  is  89.06%.  The  average  hit  ratio 
for  volume  classification  for  the  training  classification  matrix  is  91.68%  and  for  the 
leave-one-out  classification  matrix  is  88.68%.  Based  on  the  evidence  provided,  the  linear 
discriminant  function  does  very  well  in  classifying  the  training  set  of  documents. 
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The  second  research  question  addresses  how  well  the  linear  discriminant  function 
derived  from  the  training  set  of  documents  does  on  classifying  the  20%  holdout  sample. 
Specifically,  does  the  linear  discriminant  function  predict  the  correct  classification  in  the 
holdout  sample  better  than  random  guessing?  To  address  this  question,  each  document  in 
the  holdout  sample  is  classified  according  to  the  linear  discriminant  function  and  Press's 
Q  is  computed  based  on  the  number  of  correctly  classified  documents,  the  total  number 
of  documents  and  the  number  of  classification  groups  using  a  chi-square  distribution  with 
one  degree  of  freedom  for  two-group  classificafion.  Using  return  as  the  classification 
mechanism,  16  out  of  93  or  17.20%  of  the  stocks  had  holdout  classification  matrices  that 
were  statistically  significant  at  10%.  Using  volume  as  the  classification  mechanism,  the 
result  is  exactly  the  same.  Therefore,  the  methodology  does  a  fairly  good  job  of 
classification  of  the  holdout  sample  for  both  classificafion  mechanisms. 

Research  question  three  is  used  to  determine  if  there  is  a  noticeable  difference  in 
the  performance  of  the  two  classification  mechanisms.  Based  on  the  classification  results 
for  the  training  set  and  the  holdout  sample,  there  is  not  a  noticeable  difference  in  the 
performance  of  the  two  classificafion  mechanisms. 

Finally,  the  fourth  research  question  addresses  whether  adding  an  independent 
variable,    ,  to  the  set  of  variables  that  can  potentially  enter  the  linear  discriminant  model 
via  the  stepwise  discriminant  procedure  improves  the  prediction  accuracy  in  the  holdout 
sample.  The  variable     is  calculated  based  on  the  stock's  classification  in  the  three  days 
prior  to  the  current  classification  date.  For  both  classificafion  mechanisms,  the  variable 
enters  the  linear  discriminant  model  for  most  of  the  stocks.  For  return  classification, 
the  variable  enters  the  model  for  82  out  of  93  stocks.  For  volume  classificafion,  the 


97 


variable  enters  the  model  for  91  out  of  93  stocks.  However,  for  volume  classification  the 
step  at  which  the  variable  enters  the  model  on  average,  4.12  is  much  earlier  than  for 
return  classification,  41.48.  Out  of  the  82  stocks  with     entering  the  model  in  the 
stepwise  discriminant  step,  15  have  holdout  samples  with  classification  matrices  that  are 
significant  at  10%  using  Press's  Q.  For  the  same  set  of  82  stocks,  14  have  holdout 
samples  with  classification  matrices  that  are  significant  without  adding  the  variable  to 
the  set  of  independent  variables.  There  is  not  a  significant  difference  in  the  prediction 
accuracy  when  adding  x,  to  the  model  with  return  classification.  Out  of  the  91  stocks 
with  x^  entering  the  model  in  the  stepwise  discriminant  step,  32  have  holdout  samples 
with  classification  matrices  that  are  significant  at  10%  using  Press's  Q.  For  the  same  set 
of  91  stocks,  16  have  holdout  samples  with  classification  matrices  that  are  significant 
without  adding  the  variable  x,  to  the  set  of  independent  variables.  The  p-value  for  the 
difference  in  the  two  proportions,  32  out  of  91  versus  16  out  of  91,  is  0.0036.  There  is  a 
very  significant  difference  in  the  prediction  accuracy  when  adding  x^  to  the  model  with 
volume  classification. 

6.2  Discussion 

hi  this  study  the  relationship  between  news  and  stock  returns  and  between  news 
and  trading  volume  is  examined.  The  discriminant  fianction  computed  using  the  terms 
appearing  in  news  articles  in  the  training  set  as  the  independent  variables  and 
classification  based  on  stock  returns  as  the  dependent  variable  in  the  discriminant 
analysis  procedure  has  predictive  capability  for  the  holdout  sample  in  16  out  of  93  stocks. 
Based  on  this  predictive  capability  for  these  16  stocks,  a  profitable  daily  trading  strategy 
for  these  stocks  could  be  implemented.  The  ability  to  develop  such  a  strategy  contradicts 
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the  efficient  market  hypothesis  (Fama  1970).  Additionally,  the  predictive  capability  of 
the  results  shows  a  link  between  news  and  stock  returns.  The  addition  of  an  independent 
variable  that  represents  the  stock's  classification  based  on  return  for  the  three  days  prior 
to  classification  does  not  provide  a  significant  increase  in  the  number  of  stocks  in  the 
holdout  sample  with  classification  matrices  that  are  significant.  However,  the  set  of 
significant  stocks  changed  considerably,  with  10  different  stocks  appearing  in  the  set  of 
stocks  with  significant  holdout  classification  matrices. 

The  discriminant  function  computed  using  the  terms  appearing  in  news  articles  in 
the  training  set  as  the  independent  variables  and  classification  based  on  daily  changed  in 
trading  volume  as  the  dependent  variable  in  the  discriminant  analysis  procedure  has 
predictive  capability  for  the  holdout  sample  in  16  out  of  93  stocks.  The  predictive 
capability  for  the  16  stocks  as  well  as  the  training  set  classification  accuracy  for  all  93 
stocks  illustrates  that  there  is  a  link  between  news  and  subsequent  trading  volume.  This 
link  between  news  and  subsequent  trading  volume  is  consistent  with  financial  literature, 
hi  this  study,  classification  is  based  on  daily  changes  in  trading  volume,  as  opposed  to 
levels  of  trading  activity  or  turnover  ratio,  the  measurement  typically  used  in  financial 
studies.  Based  on  the  results  of  this  study,  an  interesting  application  would  be  to 
determine  which  terms  signal  a  decrease  in  trading  activity  and  which  signal  an  increase. 
The  addition  of  an  independent  variable  that  represents  the  stock's  classification  based  on 
trading  volume  for  the  three  days  prior  to  classification  does  provide  a  significant 
increase  in  the  number  of  stocks,  32,  in  the  holdout  sample  with  classification  matrices 
that  are  significant.  The  number  of  stocks  that  include  the  new  independent  variable  in 
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the  linear  discriminant  model  via  the  stepwise  linear  discriminant  procedure  is  very  high 
at  91  out  of  93  stocks. 

hi  summary,  the  addition  of  an  independent  variable  that  represents  prior 
performance  of  the  stock  results  in  a  higher  percent  of  stocks  with  the  new  independent 
variable  entering  in  the  stepwise  procedure  for  volume  classification  than  return 
classification.  The  new  variable  also  results  in  a  significantly  lower  average  entering  step 
for  the  new  variable  in  the  stepwise  procedure  for  volume  classification  than  return 
classification.  Finally,  the  new  variable  results  in  a  higher  number  of  stocks  with 
significant  holdout  classification  matrices  for  volume  classification  than  return 
classification.  These  differences  in  volume  and  return  classification  indicate  that  the 
previous  volume  activity  is  a  significant  aspect  of  the  linear  discriminant  model  as 
compared  to  previous  stock  returns. 

6.3  Conclusion 

In  conclusion,  the  environmental  scanning  methodology  developed  in  this 
dissertation  is  automated,  well  founded  and  useful  to  an  organization.  The  methodology 
is  validated  empirically.  The  training  set  has  excellent  classification  results,  with  100% 
of  the  stocks'  training  classification  matrices  and  leave-one-out  classification  matrices 
having  statistical  significance  using  either  volume  or  return  as  the  classification 
mechanism.  The  predictive  capability  of  the  linear  discriminant  function,  calculated 
using  the  training  set,  shows  great  promise  with  17%  of  the  stocks  having  holdout 
classification  matrices  with  statistical  significance.  Adding  the  independent  variable  ;c, 
to  the  set  of  potential  variables  to  enter  the  linear  discriminant  model  via  the  stepwise 
discriminant  procedure  improves  the  percentage  of  stocks  with  statistical  significance  in 
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the  holdout  classification  matrices  to  35%  when  classification  is  based  on  trading 
volume.  Finally,  the  methodology  has  great  versatility  as  shown  by  the  ease  of 
incorporating  a  variety  of  independent  variables,  the  capability  of  handling  large 
document  collections  with  a  large  number  of  terms  and  the  adaptability  to  a  variety  of 
applications. 

6.4  Future  Directions 

Based  on  the  environmental  scanning  methodology,  there  are  several  directions 
that  this  current  research  could  go.  The  volume  classification  mechanism  can  be  changed 
to  classify  according  to  level  of  trading  volume  or  trading  turnover,  defined  as  the 
number  of  shares  traded  divided  by  the  number  of  shares  outstanding,  as  opposed  to 
change  in  trading  volume  as  compared  to  the  previous  day.  The  impact  of  a  variable  that 
represents  the  nature  and  reliability  of  the  source  of  a  new  article  can  be  investigated.  A 
trading  strategy  can  be  developed  and  tested  based  on  the  stocks  that  show  prediction 
capability  in  the  holdout  sample  with  return  classification.  Linear  programming 
approaches  to  the  discriminant  analysis  problem  should  be  investigated,  as  they  are  not 
hindered  by  violations  in  the  assumptions  made  in  the  Fisher  (1936)  approach. 

Additionally,  the  environmental  scanning  methodology  developed  in  this 
dissertation  can  be  used  to  analyze  other  areas.  One  additional  application  area  involves 
the  monitoring  of  the  content  of  stock  chat-rooms  or  message  boards  with  the  same 
classification  mechanisms  discussed  in  this  dissertation.  Non-financial  applications  are 
easily  imagined  too.  For  example,  chat  room  discussions,  critic  reviews  and  press 
releases  for  movies  are  documents  that  can  be  classified  according  to  the  success  of  a 
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movie.  As  new  movies  are  released,  their  success  can  be  predicted  based  on  the  content 
of  the  aforementioned  documents. 


APPENDIX  A 
EXAMPLE 

Let  D  be  a  collection  of  three  documents  di,  dj,  and  d^.  The  documents  in  the 
space  D  are  classified  according  to  three  terms  ti,  t2,  and  t^.  Let  t],  t2,  h,  d],  d2,  da  and  D 
be  defined  as  follows. 


"f 

'0' 

"o" 

A  = 

0 

1 

,    and  - 

0 

0 

0 

1 

5    0  2 

D  = 

1    3  3 

4   0  7 

d,  = 

5t,  +2t3 

^2  = 

t,  +3t2  +3t3 

d3  = 

4t,  +7t3 

Consider  a  query  9,  =  (O   1    2)  or  equivalently  q^=tj+  2t^  then 
d,q,  =(5t,+2t3Xt,+2t3) 

=  5t,t2  +2t2t3  +10t,t3  +4t3t3 

=  (5*0)+ (2*0)+ (10*0)+ (4*1) 

=  4 

d,q,  -(t, +3t3+3t3Xt, +2t3) 

=  t,t2  +2t,t3  +3t2t2  +9t2t3  +6t3t3 

=  (l*0)+(2*0)  +  (3*l)+(9*0)+(6*l) 
=  9 

daq,  =(4t, +7t3Xt, +2t3) 

=  4t,t2  +8t,t3  +7t2t3  +14t3t3 

=  (4*0)  +  (8*0)+(7*0)+(l4*l) 
=  14 
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Hence,  ds  has  the  highest  similarity  to 


APPENDIX  B 
WEB  SITES,  STOCK  SYMBOLS  AND  STOPWORDS 


Table  18.  Financial  web  sites 


URL  of  Web  Site 

Description 

http://smartmoney.com 
http://fmance.yahoo.com 
http://momingstar.com 
http :  //tradetrek.  com 

SmartMoney  financial  web  site 
Yahoo  financial  web  site 
Momingstar  financial  web  site 
Tradetrek  financial  web  site 

Table  19.  Stocks 


oloCK  oymooi 

Name 

A    A  FIT 

AAPL 

Apple  Computer  Inc 

ABS 

Albertson's  Inc 

ACNAF 

Air  Canada 

ADSX 

Applied  Digital  Solutions  Inc 

AEOS 

American  Eagle  Outfitters  Inc 

APT 

Aetna  Inc 

AFFX 

Affymetrix  Inc 

AMM 

Autoimmune  Inc 

ALO 

Alpharma  Inc 

ALR 

Allied  Research  Corp 

ANF 

Abercrombie    &    Fitch    Co  Retail 

(Apparel) 

AOL 

AOL  Time  Warner  Inc 

ARNA 

Arena  Pharmaceutical  Inc 

ASCA 

Ameristar  Casinos  Inc 

ASF 

Administaff  Inc 

ASKJ 

Ask  Jeeves  Inc 

ATAC 

Aftermarket  Technology  Corp 

AVIR 

Aviron 

AXP 

American  Express  Co 

BA 

The  Boeing  Co 

BAB 

British  Airways  PLC 

BAS 

BC 

Brunswick  Corp  Recreational  Products 

BDK 

Black  &  Decker  Corp 

BMY 

Bristol-Myers  Squibb  Co  Major  Drugs 
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Table  19.  Continued 


oiuL/iv  oyiiiuui 

rName 

Benetton  Group  SpA 

ROQ  A 

Boston  Acoustics  Inc 

T3TT  T 

Peabody  Energy  Corp 

UT  m 
JjUJ-J 

Anheuser-Busch  Companies  Lie 

Benetton  Group  SpA 

r>  Vr 

Biovail  Corp 

Citigroup 

PRT7 

Century  Business  Services  Inc 

Cabot  Microelectronics  Corp 

Clear  Channel  Communications  Inc 

Churchill  Downs  Inc 

L-rlLK 

Chiron  Corp 

V_-1VI  1  IN 

Copper  Mountain  Networks  Inc 

Coca-Cola  Bottling  Co  Consolidated 

PHY 

Cox  Communications  Inc 

Cree  Inc 

v^rv  V 

Coast  Distnbution  System  (DE) 

Cisco 

Centillium  Communications  Inc 

rvr 

Cablevision  Systems  Corp 

r'VRT? 

Cyber-Care  Inc. 

r»  AT 

Delta  Air  Lines  Inc 

nn 

T~?  T                        Tl        A.       T~'\           XT  n 

b.l.   Du   Pont   De   Nemours   &  Co 

(DuPont) 

DFT  T 

Dell  Computer  Corp 

Dendreon  Corp 

DOW 

Dow  Chemical  Co 

DPMI 

Dupont  Photomasks  Inc 

LJi\l 

Darden  Restaurants  Inc 

FR  A  V 

eBay  Inc 

FT  Y 

Callaway  Goli  Co 

E-iNiJJr 

Endo    Pharmaceuticals    Holdings  Inc 

Major  Drugs 

Encsson,  Telefonab.  L  M  AB 

FD  T 

Embraer  SA 

FW 

Edwards  Lifesciences  Corp 

F 

-Tuiu  iviuior 

FA 

Fairchild  Corp 

FDX 

Fedex  Corp  (Federal  Express) 

FOSL 

Fossil  Inc 

FPL 

Florida  Power  &  Light  Co 

FRK 

Florida  Rock  Industries  Inc 
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Table  19.  Continued 


Name 

FRY 

Forest  Laboratories  Inc 

GD 

General  Dynamics  Corp 

VJE/ 

General  Electric  Comp. 

Guess  /  Inc 

rjT  PRY 

LrlODO  Cabo  5>A 

GT  K 

Great  Lakes  Chemical  Corp 

GT  W 

VJl_y  vv 

Coming  Inc 

GM 

General  Motors  Corp 

usnKosn  d  Lrosri  inc 

GPS 

oap  mc 

GR 

Goodrich  Corp 

GTW 

VJ  1  TV 

^jaieway  mc 

1 1  Ly  I 

Harley  Davidson 

HPT  TT 

Helen  of  Troy  Corp 

HGGR 

Haggar  Corp 

NT  VW 
nij  I  vv 

Hollywood  Entertainment 

TTOT  T 

Hollywood  Media  Corp 

HTVN 

ILL  V  1\ 

Hispanic  Television  Network  Inc 

Humana  Inc 

FTYnT 
111  lyi^ 

riycini  L-O. 

TRM 

International  Business  Machines  Corp 

TT  YO 

ILbA  Uncoiogy  Inc 

LLy  1 

Intel  L-orp 

TNT 

Johnson  &  Johnson  Inc  Major  Drugs 

TT  FMT 

Juniper  Group  Inc 

KELYA 

Kelly  Services  Inc 

KM 

Jv  Mart  Corp 

KTsTRWY 

1V1>IJ  VV  X 

Kirin  Brewery  Co  Ltd 

JVXV 

Kroger  Co 

LIZ 

Liz  Claiborne  Inc 

T  T  TP 

Linear  Technology  Corp 

T  T  V 

Ell  Lilly  and  Co  Major  Drugs 

AT&T  Corp 

I  MT 

Lockheed  Martin  Corp 

T  TRF 

1^  1 IVC 

T  '  T" 

Learning  Tree 

T  T  n/ 
LU  V 

Southwest  Airlines  Inc 

LZB 

T  Q_'7_Rrt'\/  Tni^ 
J-'a-Zj-DUy  IIIC 

MCD 

McDonald's  Corp 

MCO 

Moody's  Corp 

MO 

Philip  Morris  Companies  Inc 

MON 

Monsanto  Co 

MOND 

Robert  Mondavi  Corp 
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Table  19.  Continued 


otocK  symbol 

Name 

MU 1 

n  /f    J.  It 

Motorola  Inc 

MOV 

Movado  Group  Inc  Jewelry  &  Silverware 

ivtppp 

MP3.com  Inc  Recreational  Products 

Merck  &  Co  Inc  Major  Drugs 

ivior  1 

Microsoft 

Emerson  Radio  Corp 

MllL 

MTI  Technology  Corp 

MUC 

MTI  Technology  Corp. 

MUbl 

Micron  Electronics  Inc 

MVL 

Marvel  Enterprises,  Inc 

iVl  I  VJ 

Maytag  Corp 

INKr, 

"X  T*1  T 

Nike  Inc 

IN  UK 

XT     1  * 

Nokia 

JNISAN  Y 

Nissan  Motor  Co  Ltd 

MC  A  T 

Norsat  Intl  Inc. 

JN  VJi 

Novartis  AG  Major  Drugs 

MVTT 

Nextel  Communications 

Uneida  Ltd  Jewelry  &  Silverware 

OR  AT 

UrthAlliance  mc 

Oracle  Corp. 

(Jutback  Steakhouse  mc 

Panamerican  Beverages  Inc 

The  Pepsi  Bottling  Group  Inc 

PD 

Phelps  Dodge  Corp  Metal  Mining 

Pegasus  Solutions  Inc 

PCD 

riir 

Pepsico  Inc 

rJiK  I 

Perry  Ellis  International  Inc 

PH  A 
rrlA 

Pharmacia  Corp 

PTV 

Water  Pik  Technologies  Inc 

PMW 

r  i>  VV 

Pinnacle  West  Capital  Corp  Electnc 

T  Tx  '  1  '  J.  * 

Utilities 

PXTV 

Piedmont  Natural  Gas  Inc 

POA/f 
rUM 

Potomac  Electric  Power  Co 

PT5V 

Reebok  International  Ltd  Footwear 

KLrri 

RG  Barry  Corp  Footwear 

RT 

Polo  Ralph  Lauren  Corp 

KUrL) 

Rockford  Corp 

ROST 

Ross  Stores  Inc 

RS 

Reliance  Steel  And  Aluminum  Co 

RST 

Boca  Resorts  Inc. 

RVWD 

Ravenswood  Winery  Inc 

S 

Sears,  Roebuck  and  Co 

SABI 

Swiss  Army  Brands  Inc 
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Table  19.  Continued 


Stork  Svmhnl 

rName 

Samsonite  Corp 

SRTV 

OJJ  1  V 

or>i>  Droaacasting  bA 

bnuiile  Master  Inc 

orl  W 

Sherwin- Williams  Co 

Hollyewood  Entertainment 

oi^  V  IN 

Sylvan  Learning  Systems  Inc 

Sony  Corp 

so 

ooutnem  L-ompany  mc  ciectnc  Utilities 

SOT 

Solutia  Inc 

Spartan  Motors  Inc 

swr 

kj  vv 

Stillwater  Mining  Co 

TRT 

Timberland  Co 

TGT 

Target  Corp 

TM 

1  IVl 

Toyota  Motor  Comp. 

TMPW 

IMr  Worldwide  inc 

T^^/f 

1  ommy  Hilfiger  Corp 

LendingTree  Inc 

TTIP 

Tupperware  Corporation 

TYN 

Texas  Instruments  Incorporated 

TYTI 

TVT  T  r^r^r^ 

I  XL)  Corp 

T  FRFT 

Youbet.com  Inc 

TISAT 

UbA  JNetworks  Inc 

T  TQOM 

T  TCI                     1  T 

US  Oncology  Inc 

VNGD 

Vanguard  Airlines  Inc 

VV 

Wisconsin  Energy  Corp  Electnc  Utilities 

Wendy  s  International  Inc 

WHR 

VV  1 IIV 

Whirlpool  Corp 

WIN 

Winn-Dixie  Stores  Inc 

WT  DA 

VV  L^U/X 

World  Airways  Inc 

WlUT 

VV  iVl  1 

Wal-Mart  Stores  Inc 

VV  VJIV 

Worthington  Industnes  Inc 

wppnv 

WPP  Group  PLC 

ws 

w  cinon  oicci  \^orp  iron  oc  otcei 

X 

USX-US  Steel  Group  Iron  &  Steel 

XEL 

Xcel  Energy  Inc 

XOM 

Exxon  Mobil  Corp. 

YSTM 

YouthStream  Media  Networks  Inc 

zoox 

Gadzoox  Networks  Inc 
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Table  20.  Stopword  list  

 Stopword  List 


a 

ends 

knew 

per 

thoughts 

about 

enough 

know 

perhaps 

three 

above 

etc 

known 

place 

through 

according 

even 

knows 

places 

throughout 

across 

evenly 

1 

point 

thru 

actually 

ever 

large 

pointed 

thus 

adj 

every 

largely 

pointing 

to 

after 

everybody 

last 

points 

today 

afterwards 

everyone 

later 

possible 

together 

again 

everything 

latest 

present 

too 

against 

everywhere 

latter 

presented 

took 

all 

except 

latterly 

presenting 

toward 

almost 

f 

least 

presents 

towards 

alone 

face 

less 

problem 

trillion 

along 

faces 

let 

problems 

turn 

already 

fact 

lets 

put 

turned 

also 

facts 

let's 

puts 

turning 

although 

far 

like 

r 

turns 

always 

felt 

likely 

rather 

twenty 

among 

few 

long 

really 

two 

amongst 

find 

longer 

recent 

u 

an 

finds 

longest 

recently 

under 

and 

fifty 

ltd 

right 

unless 

another 

first 

m 

room 

unlike 

any 

five 

made 

rooms 

unlikely 

anybody 

for 

make 

s 

until 

anyhow 

former 

making 

said 

up 

anyone 

formerly 

makes 

same 

upon 

anything 

forty 

man 

saw 

us 

anywhere 

found 

many 

say 

use 

are 

four 

may 

says 

used 

areas 

fi"om 

maybe 

second 

uses 

area 

full 

me 

seconds 

using 

aren't 

fully 

meantime 

see 

V 

around 

ftirther 

meanwhile 

seem 

very 

as 

furthered 

member 

seemed 

via 

ask 

fiarthering 

members 

seeming 

w 

asked 

furthers 

men 

seems 

want 

asking 

g 

might 

sees 

wanted 

asks 

gave 

million 

seven 

wanting 

at 

general 

miss 

seventy 

wants 

away 

generally 

more 

several 

was 

5 

get 

moreover 

shall 

wasn't 
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Table  20.  Continued 


Stopword  List 

back 

gets 

most 

she 

way 

backed 

give 

mostly 

she'd 

ways 

backing 

given 

mr 

she'll 

we 

backs 

gives 

mrs 

she's 

we'd 

be 

go 

much 

should 

we'll 

became 

going 

must 

shouldn't 

we're 

because 

good 

my 

show 

we've 

become 

goods 

myself 

showed 

well 

becomes 

got 

n 

showing 

wells 

becoming 

great 

namely 

shows 

went 

been 

greater 

necessary 

side 

were 

before 

greatest 

need 

sides 

weren't 

beforehand 

group 

needed 

since 

what 

began 

grouped 

needing 

six 

what'll 

begm 

grouping 

needs 

sixty 

what's 

beginning 

groups 

neither 

small 

what've 

behind 

h 

never 

smaller 

whatever 

being 

had 

nevertheless 

smallest 

when 

beings 

has 

new 

so 

whence 

below 

hasn't 

newer 

some 

whenever 

beside 

have 

newest 

somebody 

where 

besides 

having 

next 

somehow 

Where's 

best 

haven't 

non 

someone 

whereafter 

better 

having 

nine 

something 

whereas 

between 

he 

ninety 

sometime 

whereby 

beyond 

he'd 

no 

sometimes 

wherein 

big 

he'll 

nobody 

somewhere 

whereupon 

billion 

he's 

none 

state 

wherever 

both 

hence 

nonetheless 

states 

whether 

buy 

her 

noone 

still 

which 

but 

here 

nor 

stop 

while 

by 

here's 

not 

such 

whither 

c 

hereafter 

number 

sure 

who 

came 

hereby 

numbered 

t 

who'd 

can 

herein 

numbering 

take 

who'll 

can't 

hereupon 

numbers 

taken 

who's 

cannot 

hers 

nothing 

taking 

whoever 

caption 

herself 

now 

ten 

whole 

case 

high 

nowhere 

than 

whom 

cases 

higher 

0 

that 

whomever 

certain 

highest 

of 

that'll 

whose 

certainly 

him 

off 

that's 

why 

clear 

himself 

often 

that've 

will 
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Table  20.  Continued 


Stopword  List 


clearly 

his 

old 

the 

with 

CO 

how 

older 

their 

within 

CO. 

however 

oldest 

them 

without 

come 

hundred 

on 

themselves 

won't 

1  J 

could 

once 

then 

work 

couldnt 

id 

one 

thence 

worked 

a 

*t11 

ill 

one's 

there 

working 

did 

im 

only 

there'd 

works 

didn't 

i've 

onto 

there'll 

would 

differ 

ie 

open 

there're 

wouldn't 

different 

if 

opened 

there's 

X 

differently 

important 

opening 

there've 

y 

do 

in 

opens 

thereafter 

year 

does 

inc. 

or 

thereby 

years 

doesnt 

indeed 

order 

therefore 

yet 

done 

instead 

ordered 

therein 

yes 

dont 

interest 

ordenng 

thereupon 

yet 

down 

interested 

orders 

these 

you 

downed 

interesting 

other 

they 

you'd 

downing 

interests 

others 

they'd 

you'll 

downs 

into 

otherwise 

they'll 

you're 

during 

is 

our 

they're 

you've 

e 

isn't 

ours 

they've 

young 

each 

it 

ourselves 

thing 

younger 

early 

Its 

out 

things 

youngest 

eg 

its 

over 

think 

your 

eight 

Itself 

overall 

thinks 

yours 

eighty 

j 

own 

thirty 

yourself 

either 

just 

P 

this 

yourselves 

else 

k 

part 

those 

z 

elsewhere 

keep 

parted 

though 

end 

keeps 

parting 

thousand 

ended 

kind 

parts 

thought 

Source:  Frakes  and  Baeza- Yates  1992 
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